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I.  INTRODUCTION 

The  first  Reduced  Instruction  Set  Computer  (RISC) 
appeared  at  the  end  of  the  1970' s  and  since  then  long  and 
heated  discussions  have  taken  place  in  the  computer  archi- 
tecture community.  These  discussions  centered  around  the 
validity  of  the  claims  made  by  the  RISC  proponents  regarding 
the  performance  achieved  by  the  proposed  machines  when 
compared  to  traditional  computers  that  are  referred  to  as 
Complex  Instruction  Set  Computers  (CISC). 

Due  to  a  lack  of  an  appropriate  method  to  evaluate  the 
performance  effects  of  various  architectural  features,  it  is 
difficult  to  resolve  the  RISC/CISC  controversy. 

The  interest  in  the  ideas  proposed  by  this  philosophy 
has  been  growing,  and  presently  many  of  the  major  computer 
companies  are  investing  a  great  deal  in  this  new  type  of 
computer  architecture. 

This  thesis  tries,  first,  to  define  the  basic  character- 
istics of  a  Reduced  Instruction  Set  Computer,  so  that  it  is 
possible  to  focus  on  the  specific  architectural  features 
peculiar  to  RISC  machines. 

The  approach  that  in  the  author's  opinion  has  to  be 
followed,  in  order  to  evaluate  computer  performance, 
together  with  the  author's  disagreement  on  the  approach 
taken  on  several  published  comparisons  between  RISC  and  CISC 
machines,  are  presented. 

A  model  for  computer  performance  evaluation  is 
suggested.  This  model  is  composed  of  two  parts.  The  first 
part  deals  with  the  timing  analysis  of  the  computer  perform- 
ance. The  second  part  sets  a  criterion  to  determine  the 
efficiency  of  a  given  computer  control  unit  when  used  for  a 
given  application.  Finally  in  order  to  evaluate  the  model, 
an  example  is  given  demonstrating   the  quantification  of  the 


performance  effects  of  an  architectural  enhancement  to  a 
system  architecture. 

The  model  suggested  for  computer  performance  evaluation 
constitutes  a  departure  from  the  current  computer  perform- 
ance evaluation  methods,  because  the  attention  is  centered 
on  the  computer  architecture  rather  than  on  the  measurements 
of  throughput,  response  time  and  mean  job  turnaround  time 
where  the  main  emphasis  of  the  evaluation  process  is  put  on 
the  software. 

The  model  is  intended  to  provide  a  tool  for  computer 
architects  to  use,  so  that  discussions  regarding  the 
performance  achievements  of  certain  architectural  features 
might  be  quantified  and  rational  conclusions  may  be  reached. 
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II.  WHAT  IS  A  RISC  ? 

A.   INTRODUCTION 

In  recent  years  a  new  type  of  computer  architecture  has 
received  a  great  deal  of  attention. 

This  new  architecture  is  mainly  the  result  of  an  effort 
conducted  in  an  academic  environment.  Profiting  from  the  new 
possibilities  that  custom  VLSI  offers,  the  professors  and 
students  at  the  University  of  California  at  Berkeley, 
collaborating  in  several  courses  in  this  area,  began 
projects  on  building  single  chip  computers. 

Due  to  limitations  of  the  chip  area,  available  tools  and 
the  available  time  for  the  completion  of  the  project, 
several  simplifications  to  contemporary  architectures  were 
made.  For  example,  the  instruction  set  was  simplified  by 
eliminating  all  instructions  that  might  be  called  composite 
instructions.  This  type  of  instruction  is  equivalent,  in  the 
operation  performed,  to  a  sequence  of  other  more  elementary 
( atomized)  instructions. 

A  claim  has  been  made,  that  the  obtainable  performance 
of  these  machines  was  unexpectablly  remarkable  and  this 
triggered  a  major  discussion  on  the  subject  of  the  merits  of 
Rise's. 

Feeding  the  controversy  is  undoubtly  the  lack  of  an 
appropriate  method  or  tool  to  measure  computer  architecture 
performance  and  the  effects  of  a  particular  architecture 
modification  on  the  computer  performance. 

From  the  very  beginning  the  RISC  machines  were  related 
to  implementation  issues  in  the  use  of  VISI  technology. 

Proponents  called  the  approach  "RISC",  for  Reduced 
Instruction  Set  Computers,  as  opposed  to  the  traditional 
computers  which  they  referred  to  as  "CISC' S",  for  Complex 
Instruction  Set  Computers. 
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The  "new  architecture"  proponents  didn't  present  it  as  a 
proposal  to  enhance,  in  some  way,  the  prevailing  architec- 
ture, but  as  a  complete  departure  from  the  previous  work. 

No  precise  definition  has  ever  been  given  for  the 
complete  characteristics  of  a  RISC  machine,  and  because  of 
that,  there  are  now  in  existence  several  different  machines 
all  claiming  to  be  RISC' s.  Although  there  are  some  common 
features  there  is  no  clear  cut  agreement  on  what  comprises  a 
reduced  instruction  set  computer. 

No  doubt  some  very  valid  ideas  were  brought  to  the 
computer  architecture  environment  by  the  "RISC  philosophy 
proponents",  but,  nevertheless,  it  constitutes  a  sure  risk 
to  accept  a  new  idea  without  an  open,  substantiative  debate 
where  the  benefits  are  separated  from  the  jargon. 

The  first  step  in  understanding  and  identifying  the  RISC 
trade-off  is  a  more  precise  definition  of  RISC. 

As  stated  above,  several  implementations  of  RISC' s  are 
already  in  existence,  and,  of  these,  four  have  undoubtly 
enough  importance  to  be  mentioned. 

They  are: 

1)  The  RISC   I  and  II,    developed  at  the   University  of 
California  at  Berkeley 

2)  The  801  Minicomputer,   developed  at  the  IBM  Thomas  S. 
Watson  Research  Center 

3)  The  MIPS,  developed  at  Stanford  University. 

In  order  to  develop  a  definition  of  the  "RISC"  the 
existing  "RISCs"  should  be  studied. 

B.   THE  RISC  I  AND  II 

The  RISC  I  and  II  were   both  developed  at  the  University 
of  California  at  Berkeley  where  the  acronym  RISC  originated. 
Since  both  were  developed  at  U.   C.   Berkeley,  they  are  very 
similar  in  their  composition.   In  fact,    RISC  II  is  no  more 
than  an  enhanced  version  of  RISC  I. 

Both  are  single  chip  VLSI  processors  having  the 
following  characteristics: 
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1)  They  are  32-bit  machines.   That  is,  all  registers  and 
busses  are  32  bit  wide. 

2)  Instruction  Set: 

2a)  RISC  I  has  31  instructions 
RISC  II  has  39  instructions 

2b)  Both  have  a  load/store  architecture.  This  means 
that  all  instructions  except  load  and  store  are 
register-to-register.  Load  and  store  are  the  only 
memory-reference  instructions. 

2c)  All  instructions  except  LOAD  and  STORE  are  single- 
cycle  where  a  cycle  is  the  time  it  takes  to  read 
and  add  two  registers,  and  then  store  the  result 
back  into  a  register. 

2d)  All   instructions  are   the   same   size  (32   bits). 
There  are  two  different  formats  but  the  fields  are 
at  fixed  locations. 

2e).  Addressing  Modes: 

There  are  two  addressing  modes;  one  for  register- 
to-register  instructions--Register  Direct  and  the 
other  for  memory  reference  instructions--Index  + 
Displacement. 

3 )  Registers 

3a)  Total  number  of  on-chip  registers 

RISC  I   138 

RISC  II  198  ^ 

3b)  The  processor  is  organized  in  multiple  overlapping 
windows  in  order  to  facilitate  parameter  passing 
between  procedures. 

The  windows  are  organized  in  a  circular  buffer 
fashion.  In  the  case  that  the  nested  procedure 
depth  is  greater  than  the  number  of  windows  minus 
one,  the  values  in  the  window  corresponding  to  the 
oldest  procedure  are  stored  in  memory  and  this 
window  is  then  free  to  be  allocated  to  the  current 
procedure.  At  any  time  32  registers  are  visible 
constituting  what  is  called  the  "current  window". 
All  windows  have  a  fixed  size  and  the  composition 
shown  in  Figure  2.  1. 

The  global  registers  are  common  to  all  procedures, 
and  therefore  they  are  used  to  store  global  vari- 
ables. Register  KG  holds  a  fixed  value  of  zero. 
The  low  registers  are  common  to  the  current  proce- 
dure and  to  the  called  procedure,  although,  m  the 
called  procedure,  they  will  nave  a  different 
number  since  there  they  constitute  the  high  regis- 
ters of  the  corresponding  window.  The  high  regis- 
ters are  common  to  the  current  procedure  and  to 
the  calling  procedure.  The  high  and  low  registers 
along  with  the  global  registers  constitute  the 
overlapped  part  of  each  window  and  are  used  for 
parameter  passing  between  procedures.  The  local 
registers  are  only  visible  m  the  current  window. 

4)  The  control  unit  is  hardwired  with  most  of  its  logic 
implemented  using  PLA's. 

5)  Pipeline  Stages 
The  RISC  I  has  two  pipeline  stages,   i.e.,   depending 
on   the  program   sequence  it   can   prefetch  the   next 
instruction    while     it    executes     the    present 
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Figure  2. 1    RISC  Register  Window. 

instruction.    The  RISC  II  has  three  pipeline  stages, 
i.e.,    depending   on   the  program   sequence   it   can 
prefetch   the  next   instruction  and   store  the   final 
results  of   the  previous   instruction  in   a  register, 
while  it  executes  the  present  instruction. 

5)  Use  of  Delayed  Branch 

In  order  to  increase  speed  and  not  to  discard  the 
prefetch  instruction,  when  a  branch  instruction  is 
executed,  the  branch  takes  place  only  after  the 
execution  of  the  next  sequential  instruction. 
Typically  the  compiler  arranges  for  the  instruction 
following  the  branch  to  be  part  of  the  loop,  see 
[Ref.  1]. 

8)  Imolementation 

RIBC  I  is  implemented  with  4  micron  NMOS  VLSI 
technology  with  a  clock  of  8  MHZ  and  a  cycle  of  500 
NSEC.  RISC  II  is  implemented  with  3  micron  NMOS  VLSI 
technology  with  a  clock  of  12  MHZ  and  a  cycle  of  330 
NSEC. 
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9)  Both  RISC  I  and  II  have  no  floating-point  support. 

C.   THE  801  MINICOMPUTER 

Developed  by  IBM  at  the  Yorktown  Heigths  Research  Center 
from  1975  until  1979,  it  was  the  first  machine  to  follow 
what  later  would  be  called  "The  RISC  Approach  to  Computer 
Architecture" . 

Due  to  its  proprietary  nature,  not  much  is  known  about 
it,  but  some  of  the  ideas  present  in  its  design  are  known 
and  have  been,  in  a  certain  way,  the  basis  for  the  develop- 
ment of  RISC  I  and  I I  at  Berkeley  and  MIPS  at  Stanford. 

As  opposed  to  the  RISCs  and  the  MIPS,  the  801  is  not  a 
single  chip  processor  but  a  minicomputer. 

The  general  approach  is  the  basis  for  the  design  of  an 
IBM  NMOS  VLSI  single  chip  processor  known  as  ROMP  or  802. 

The  801  machine  is  basically  a  32  bit  architecture  with 
single-cycle  four  byte  instructions  and  32  registers.  It  has 
separate  data  and  instruction  cache  memories.  As  in  RISC  I 
and  II,  the  801  also  has  a  delayed  branch  scheme,  that  is 
the  branch  only  takes  place  after  the  execution  of  the  next 
instruction. 

The  801  system  is  said  to  be  compiler-based  meaning  that 
a  greater  demand  is  made  on  the  compiler. 

The  801  architecture  was  defined  by  George  Radin  in  his 
article  'The  801  Minicomputer'  [ Ref .  2]  as  the  set  of  run 
time  operations  which: 

1)  Could  not  be  moved  to  compile  time 

2)  Could  not  be  more  efficiently  executed  by  object  code 
)roduced  by   a  compiler   which     '    '    '  '     '^  ^  ''* 
.evel  intent  of  the  program,  or 


produced  by   a ^compiler   which   understood  the   high- 


3)  Was  to  be  implemented  in  random  logic  more  effec- 
tively than  the  equivalent  sequence  of  software 
instructions. 

Both   data  and   address   busses  are   32   bit  wide.    The 

addressing  modes  are  few: 

-  base+index 

-  base+displacement 
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-  register  direct. 

Also  a  highly-effective  optimizing  compiler  was  devel- 
oped for  the  system. 

D.   THE  MIPS 

The  MIPS  computer  was  developed  at  Stanford  University 
by  John  Hennessy  and  his  students.  Its  acronym  stands  for 
Microprocessor  without  Interlocked  Pipe  Stages. 

There  are  strong   similarities  with  the  RISC   project  at 

Berkeley.   It  has,  however,  some  conceptual  differences  that 

have  already  been  identified  by  its  proponents  in  Ref.  3  as: 

i)  more  complex  user  level  instruction  set. 

ii)  the  main  design  goal  is  high  performance  of  the 
hardware  employed  and  not  simplicity  of  the 
instruction  set. 

iii)  much  more  complex  compiler. 

Specifically  its  characteristics  are  the  following: 

1)  32  bit  machine. 

2)  Instruction  Set 
2a)  55  instructions 

2b)  Load/store  architecture 

2c)  All  instructions  except  LOAD  and  STORE  are  single- 
cycle 

2d)  Instructions  may  be  16  or  32  bit  long.  ■  An  opti- 
mizing compiler  reorders  the  instructions  so  that 
all  lb  bit  instructions  always  come  in  pairs. 

2e)  Addressing  Modes 

-  immediace 

-  base  with  offset 

-  indexed 

-  base  shift 


Registers 

There  are  sixteen  32-bit  general  purpose  registers. 

Hardwired  control  with  most   of  its  logic  implemented 
using  PLA' s 

Use  of  Delayed  Branch  instructions 

Five  pipeline  stages 

No  condition  codes 

Word-addressable  machine 

Separate  data  and  instructions  memory 


10)  No  support  for  floating-point  operations 
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11)  Implemented  with  4  micron  NMOS  VLSI  technology  with 
a  clock  rate  of  8  MHZ. 


E.   TOWARD  A  DEFINITION  OF  A  RISC  MACHINE 

Four  machines  have  been  described  as  examples  of  a  new 
type  of  computer  architecture  defined  as  the  RISC  architec- 
ture, as  opposed  to  the  traditional  architecture  now 
referred  to  as  CISC  architecture. 

Any  definition  of  this  architecture  will  have  to  encom- 
pass the  characteristics  common  to  the  four  previous 
examples. 

To  summarize,  a  RISC  Machine  will  have  the  following 
characteristics: 

1)  Simple  instruction  set  where  the  great  majority  of 
the  instructions  are  single-cycle, 

2)  Load/store  architecture,  that  is  all  instructions  are 
register-to-register  with  the  LOAD  and  STORE  being 
the  only  memory-reference  instructions, 

3)  Very  .few  addressing  modes, 

4)  Hardwired  Control  i. e.  no  microcode, 

5)  Instructions  with  one  or  two  sizes  and  with  fields  at 
fixed  locations, 

6)  Some  degree  of  pipelining, 

7)  Demand  on  the  compiler  to  increase  performance. 
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III.  MY  APPROACH  TO  COMPUTER  PERFORMANCE  EVALUATION 

A.  INTRODUCTION 

This  thesis  has  been  motivated  by  the  rise  of  the  new, 
RISC  computer  architecture  trend,  described  in  the  previous 
chapter,  and  by  the  claims  made  by  RISC  proponents  regarding 
the  inherent  superior  performance  of  RISC  when  compared  to 
traditional  architectures. 

Unfortunately,  the  claims  made  for  these  structures  were 
not  supported  by  any  quantitative  arguments.  No  specific 
attention  was  given  to  the  effects  of  various  factors  intro- 
duced in  the  RISC  architecture  and  to  the  influence  that 
each  factor  had  on  the  system  performance. 

Computer  performance  evaluation  is  different  depending 
on  the  aspects  of  performance  being  evaluated.  From  the 
view  point  of  a  potential  computer  system  buyer,  there  is  a 
need  to  identify  features  in  the  system  which  will  enhance 
the  performance  for  a  particular  application.  From  the 
viewpoint  of  a  computer  architect,  performance  analysis  is  a 
way  to  evaluate  specific  enhancements  from  which  trends  in 
computer  architecture  design  may  follow. 

B.  EVALUATION  AND  MEASUREMENTS 

In  order  to  perform  an  evaluation  of  any  kind,   one  must 
take  measurements  of  the   system  under  different  conditions. 
One  wants   to  take  the   measurements  properly,   or   else  the 
evaluation  will  be  unvalid. 

In  order  to  guarantee  that  the  evaluation  will  be  based 
upon  correct  data,  one  has  to  know: 

1)  V7hat  the  measurements  are  for 

The  buyer  is  not  worried  about  any  of  the  architectural 
details  of  the  machine,  but  rather  about  the  throughput  of  a 
system  programmed  in  a  high-level  language. 
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In  contrast,  the  computer  architect  must  be  concerned 
with  the  internal  characteristics  and  the  behavior  of  the 
system,  even  when  he  is  testing  a  system  using  programs 
written  in  high-level  languages. 

Considering  the  RISC  family  of  machines  the  correct 
point  of  view  is  undoubtly  the  latter  one. 

2)  What  is  measured 

Typically  one  wants  to  test  how  each  enhancement  to  the 
computer  architecture  affects  the  system  performance.  In 
order  to  get  a  realistic  comparison  of  features,  only  one 
feature  at  a  time  may  differ.  If  more  than  one  feature  is 
different,  it  is  difficult  to  measure  the  individual  effect 
of  each  architectural  feature  on  the  system  performance. 

3)  How  is  the  evaluation  performed 

Because  it  is  not  feasible  to  build  a  new  system  each 
time  one  of  the  architectural  features  is  altered,  a  model 
is  required. 

Because  it  is  through  the  use  of  a  model  that  the 
performance  effects  of  any  architectural  feature  will  be 
determined,  this  model  has  to  be  able  to  quantify,  in  a 
precise  manner,  the  effects  of  any  change  in  the 
architecture. 

4)  For  which  application  are  the  measurements  valid 

The  application  for  which  the  system  is  being  used  has 
an  effect  on  the  system  performance.  No  system  will  show  the 
same  performance  in  two  different  environments.  For  example, 
in  one  application  the  user  might  be  doing  only  word- 
processing,  and,  in  the  second,  the  system  might  be 
floating-point  intensive. 

There  are,  nevertheless,  systems  that  present  a  balanced 
performance  throughout  a  diversified  number  of  applications. 
They  are  the  so  called  "General  Purpose  Computers".  But  even 
for  these,  the  performance  fluctuates,  indicating  that 
general  purpose  computers  have  a  better  performance  for  some 
applications  than  for  others. 
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Due  to  these  reasons,  the  system  performance  evaluation 
must  pay  attention  to  the  rigorous  definition  of  the  appli- 
cation for  which  the  system  performance  is  being  evaluated. 

This  requirement  for  a  precise  definition  of  the  appli- 
cation, will  clarify  the  validity  of  the  conclusions. 

5)  Which  factors  interact  with  the  measurements 

In  the  second  question,  the  need  to  make  just  one  change 
at  a  time  when  making  the  evaluation  is  emphasized,  other- 
wise it  would  be  impossible  to  determine  the  individual 
effect  of  an  enhancement  on  the  system  performance. 

Specifically  if  the  evaluator  has  already  made  measure- 
ments for  several  changes  in  the  architecture  and  has  also 
quantified  the  effect  of  each  of  those  changes  on  the  system 
performance,  it  is  possible  to  compare  two  systems,  that 
differ  by  all  those  changes  plus  an  extra  one,  not  yet 
considered.  As  a  result  of  the  analysis,  the  effect  of  this 
last  change  on  the  system  performance  can  be  quantified. 

C.   THE  RISC/CISC  CONTROVERSY 

Because  the  problem  being  discussed  is  related  to 
computer  architecture,  there  is  a  need  for  a  concise  state- 
ment defining  Computer  Architecture  as  it  is  commonly 
understood. 

The  adopted  definition  is  the  IEEE  standard  729-1983 
stating  Computer  Architecture  as: 

"  The  process  of  defining  a  collection  of  hardware  and 
software  components  and  their  interfaces  to  establish  a 
framework  for  the  development  of  a  computer  system.  " 

In  the  published  papers  on  RISC,  several  comparisons  of 
CISC  and  RISC  examples  were  made. 

The  way  these  comparisons  were  done  did  not  give  any 
insight,  to  the  answers  to  the  questions  presented  in  the 
previous  section,  or  other  similar  questions. 

The  result  is  that  now,  no  one  knows  for  example,  if  the 
performance  of  the  RISC  II  is   due  primarily  to  its  register 
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organization  scheme,  as  some  claim,   or  to  the  simplicity  of 
its  instruction  set,  as  others  do. 
Specifically, 

1)  If  one  wants  to  evaluate  the  effects  of  reducing  the 
instruction  set,  one  might  pick  a  CISC  machine  e. g. 
the  VAX-11  and  consider  the  improvements  due  to  ail 
the  instructions  whose  execution  is  equivalent  to  a 
sequence  of  simpler  instructions.  For  each  of  these 
more  complex  instructions  one  could  determine  if  the 
execution  is  faster  than  the  equivalent  sequence.  If 
that  is  not  the  case,  the  instruction  should  be 
discarded.  If  an  improvement  is  seen,  then  consider 
the  cost  of  adding  the  instruction  to  the  instruction 
set. 

2)  If  one  wants  to  evaluate  the  effects  of  reducing  the 
number  of  addressing  modes,  one  should  consider: 

-  Why  are  they  needed  ? 

-  With  which  data  types  are  they  used  ? 

-  What  its  the  benefit  brought  by  its  addition. 

3)  If  one  wants  to  evaluate  the  effects  of  overlapped 
register  windows,  one  should  test  implementation  of 
overlapped  windows  on  several  systems  and  measure,  as 
a  cost/benefit  ratio,  the  effect  of  overlapped 
windows  on  the  system  performance. 

4)  One  cannot  change  more  than  one  feature  at  a  time  and 
hope  to  get  an  idea  of  what  the  effect  of  each 
feature  is  on  the  system  performance. 


5)  If  one  wants  to  do  an  evaluation  using  programs 
written  in  a  high-level  language,  one  should  state 
that  as  a  limiting  factor.  Since  different  compilers 
generate  different  code,  some  compilers  are  better 
rhan  others  and  therefore  make  different  contribu- 
tions to  the  system  performance.  Furthermore,  in  the 
case  of  compiler  generated  code,  the  frequency  of 
execution  of  each  instruction  in  the  system  instruc- 
tion set  will  be  different  for  different  high-level 
languages.  Besides,  two  different  systems  v/ith 
distinct  instruction  sets  do  not  necessarily  have  the 
same  best  compiler. 

6)  If  one  wants  to  make  some  conclusive  statement  about 
the  advantages  and  disadvantages  of  the  RISC  archi- 
tecture, one  must  separate  the  effects  of  features 
that  are  orthogonal  to  the  RISC  philosophy. 

The   fact  is   that  in   the  papers   published  on   RISC' s, 

almost   all  the   comparisons   made,    involved  systems   with 

different  instruction  sets,  different  addressing  modes  and  a 

different   number  of   registers   and  registers   organization 

schemes.     Furthermore   compiler  generated   code   was   used 

without  considering  the  performance  effects.    These  are  the 

reasons  why  no  one  can  say   whether  the  RISC  architecture  is 

or  is  not  better  by  itself. 
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In  this  situation,  while  the  RISC  proponents  are 
bringing  some  jargon  to  the  architectural  environment,  those 
against  RISC  are  losing  track  of  the  possible  benefits 
present  in  the  RISC  philosophy. 

D.   AN  EXAMPLE 

As  an  example,  let  us  pick  a  common  CISC  processor,  the 
MC68000  and  consider  its  addressing  modes. 

The  MC58000  has  six  basic  types  of  addressing  modes, 
namely: 

1)  REGISTER  DIRECT  -  The  effective  address  is  the 
register  designation  field  in  the  instruction. 

EA  =  Rn 

2)  ABSOLUTE  -  The  effective  address  is  that  given  in  the 
instruction  field  itself  and  it  is  used  directly 
without  modification 

EA  =  INSTRUCTION  FIELD 

3)  REGISTER  INDIRECT  -  The  effective  address  is  the 
contents  of  the  designated  register 

EA  =  (  Rn  ) 

4)  IMMEDIATE  -  The  operand  is  part  of  the  instruction 
itself  and  no  further  addressing  is  needed 

5)  PROGRAM  COUNTER  RELATIVE  -  The  effective  address  is 
computed  by  taking  the  value  in  the  program  counter 
register  and  adding  or  subtracting  an  offset  value 

EA  =  PC  +  OFFSET 

or 

EA  =  PC  -  OFFSET 

5)  IMPLIED  -  The  operand  is  in  a  register  designated  by 
the  mnemonic  of  the  instruction. 

The  uses  of  each  addressing  mode  depends  on  the 
programmer. 

Until  now,  the  philosophy  present  in  the  design  process 
was  to  give  the  maximum  versatility  possible  to  the 
programmer,  so  that  he  or  she  could  choose  the  address  mode 
better  suited  to  his  or  her  needs.  The  rise  of  the  RISC 
architecture  brings  some  questions  regarding  the  correctness 
of  this  philosophy. 
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In  order  to  answer  these  questions,  there  is  a  need  to 
have  a  correct  method  for  the  evaluation  of  a  system 
performance.  Together  with  the  evaluation  method  there  are 
some  points  that  have  to  be  considered  when  deciding  how 
many  addressing  modes  to  include  in  the  system  instruction 
set  and  how  long  each  addressing  mode  should  be. 

The  considerations  are  to: 

1)  reduce  the  storage  requirements  per  program 

2)  reduce  the  number  of  bits  that  must  be  moved  between 
processor  and  memory  to  execute  a  program,  i. e.  , 
reduce  the  bandwidth  requirements  on  the  bus 

3)  reduce  the  average  length  of  an  instruction,  i.e. , 
reduce  the  required  width  of  the  instruction  bus. 

There  is  a  trade-off  between  the  number  of  instructions 
needed  for  the  system  to  execute  a  program  and  the  average 
instruction  size. 

The  decision  regarding  the  number  of  addressing  modes  to 
include  is  also  very  much  dependent  on  the  application,  on 
the  data  types,  on  the  operations  involved,  on  the  use  of 
nested  procedures,  and  how  the  parameter  passing  operation 
is  accomplished  between  procedures. 

Although  not  considered  here,  the  addressing  problem  is 
also  very  much  related  to  schemes  of  memory  protection  where 
one  wants  to  forbid  the  regular  user  program  from  accessing 
some  part  of  memory. 

Besides  how  each  one  of  the  addressing  modes  is  used,  it 
is  also  important  to  consider  the  frequency  with  which  each 
addressing  mode  is  used. 

Not  much  material  is  available  regarding  the  usage  of 
addressing  modes.  As  an  example,  consider  again  the 
addressing  modes  of  the  MC68000. 

1)  REGISTER  DIRECT 

Since  the  operand  is,  in  this  case,  in  a  register,  no 
memory  accesses  are  involved.  This  provides  some  speed 
advantages  when  used  for  operating  on  frequently-accessed 
variables.   For  infrequently-accessed  variables  it  would  not 


be  used  because  the  number  of  registers  available  on-chip  is 
usually  very  small. 

2)  ABSOLUTE 

A  memory  access  cycle  is  involved  in  absolute 
addressing,  because  the  operand  is  in  memory.  For  this 
reason  it  is  not  as  fast  as  the  previous  mode. 

Absolute  addressing  does  not  have  much  versatility 
because  the  instruction  address  field  is  constant  and  the 
operand  must  reference  a  fixed  location  in  memory. 
Nevertheless,  it  is  simple.  Because  no  alteration  on  the 
address  field  of  the  instruction  is  performed,  absolute 
addressing  is  an  efficient  mode  to  use  when  the  operand  is 
within  the  range  of  the  instruction. 

3)  REGISTER  INDIRECT 

In  the  register  indirect  mode,  one  register  access  plus 
one  memory  access  cycle  are  involved  because  the  register 
holds  the  operand  address  and  not  the  operand  itself. 

The  register  indirect  approach  is  used  when  the  address 
of  the  operand  has  just  been  calculated.  It  provides 
address-range  extension,  and  in  fact  this  extension 
increases  with  the  difference  between  the  size  of  the 
instruction  address  field  and  the  size  of  the  specified 
register. 

4)  IMMEDIATE 

Immediate  addressing  is  the  fastest  way  of  addressing, 
although  it  is  limited  by  the  instruction  size.  No  addi- 
tional memory  accesses  are  needed  since  the  operand  is 
within  the  instruction  itself.  Since  programs  are  not  self- 
modifying it  is  used  only  for  predefined  values constants. 

5)  PROGRAM  COUNTER  RELATIVE 

The  major  advantage  of  relative  addressing  is  that  it 
allows  the  generation  of  position  independent  code  because 
the   location  referenced   is  always   fixed   relative  to   the 
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program  counter.  The  importance  of  this  fact  is  very  much 
dependent  on  the  memory  management  scheme  adopted  in  the 
system. 

In  addition  to  the  regular  memory  access,  an  addition  or 
subtraction  must  also  be  executed.  It  is  used  in  relative 
jump  instructions  e. g. ,  to  set  up  loops  or  to  set  up  parame- 
ters to  be  passed  to  a  subroutine. 

5)  IMPLIED 

Implied  addressing  is  equivalent  to  the  register  direct 
addressing.  However,  implied  addressing  restricts  the 
opcode  to  the  predetermined  register  specified  by  the  design 
of  the  opcode  and  the  design  of  the  processor. 

E.   SUGGESTED  APPROACH 

It  is  not  feasible  to  build  a  new  system  each  time  a 
single  architectural  feature  is  changed,  in  order  to  eval- 
uate its  effects  on  system  performance. 

As  a  result,  there  is  then  need  for  a  model. 
This   model  should  be  clear,    complete,    and  able   to 
reflect  the  interrelations  that   exist  between  the  different 
components.    The  model   should  also   be   applicable  to   any 
computer  system,  i. e. ,  the  model  should  be  general. 

The  model  should  reflect  the   performance  effects  of  any 
computer  architectural  feature  such  as: 
Bus  Width 
Addressing  Modes 
Pipelining 
Instruction  Queue 
Instruction  Prefetching 
In  the  method  suggested  for  computer  performance  evalua- 
tion,  a  comparison   is  made  between  a   reference  system  and 
the  same  system  with  some   change.    The  reference  system  is 
the  computer  system  for  which  it  is  desired  to  determine  the 
impact  of   each  architectural   enhancement.    The   result  of 
this   comparison   will   then  constitute   a   measure   of   the 
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performance  effects  of  the  particular  change.  The  concep- 
tual view  of  the  system  used  in  the  model  is  illustrated  in 
Figure  3.  1. 


Figure  3. 1    Conceptual  View. 

Four  entities  are  considered: 

1)  The  Application,    any  evaluation  will  only   be  valid 
for  a  certain  application,  nor  for  any  application 

2)  The  System  being  considered 

3)  The  System  Instruction  Set 

4)  The   Performance,   as   the  object   of  the   evaluation 
process. 

The  instruction  set  constitutes  the  central  point  of  the 
conceptual  view.  The  application  uses  it.  The  system 
supports  it.  The  best  match  will  necessarily  give  the  best 
performance. 

The  application  is  characterized  by  a  set  of  tasks  that 
must  be  performed.  Each  task  is  performed  with  a  different 
frequency.   For  each  task  a  program  must  be  written,  so  that 
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one  task  is  mapped  into  one  program.  Each  one  of  these 
programs  executes  in  a  different  time. 

The  weight  of  each  task  or  its  representation  in  the 
application  is  then  the  product  of  the  frequency  of  its 
execution  and  the  corresponding  program  execution  time. 

The  effects  of  the  application  on  the  system  performance 
are  the  frequency  of  execution  of  each  instruction  in  the 
system  instruction  set.  This  together  with  the  average 
execution  time  of  the  programs  of  interest  will  ultimately 
lead  to  a  "  typical  "  program  of  the  application. 

The  system  supports  an  instruction  set  in  two  ways:  one 
by  the  execution  time  of  each  instruction  and  the  other  by 
the  complexity  of  the  control  unit  necessary  to  implement 
the  instruction  set. 

An  instruction  set  is  desired  that  allows  for  the 
writing  of  programs  with  a  minimum  execution  time,  but  also 
minimizes  the  amount  of  support  that  has  to  be  given  by  the 
system. 
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IV.  TIMING  ANALYSIS 

A.   INTRODUCTION 

In  this  chapter  a  detailed  analysis  of  the  model  for 
computer  performance  evaluation  is  introduced.  As  described 
in  the  previous  chapter  the  model  is  divided  into  two  parts. 
In  the  first  part,  the  model  considers  a  timing  analysis.  In 
this  analysis  the  application  determines  the  dynamic 
frequency  of  execution  of  each  instruction  present  in  the 
system  instruction  set  and  finally  the  system  architectural 
characteristics  determine  the  execution  time  of  each 
instruction. 

In  the  second  part  of  the  model,  which  follows  in  the 
next  chapter,  the  model  considers  the  relation  between  the 
application  and  the  control  unit  necessary  to  implement  the 
system  instruction  set.  From  this  relation  a  performance 
figure  is  obtained. 

Any  .architectural  feature  will  have  consequences  both  in 
the  execution  time  of  each  instruction  and  in  the  complexity 
of  the  control. 

As  has  already  been  mentioned  the  first  part  of  the 
model  is  a  timing  measure.  It  will  consider  the  execution 
time  of  the  specified  application's  "  typical  "  program. 

Several  factors  contribute  to  the  execution  time  of  a 
program  and  not  all  of  them  are  part  of  the  computer  archi- 
tecture. Some  have  depend  on  the  implementation  of  the 
system. 

The  implementation  is  very  much  related  to  the  tech- 
nology chosen.  The  technology  will  determine,  for  example, 
the  maximum  clock  rate  obtainable  and  the  number  of  computer 
components  to  be  placed  on  chip. 

Two  factors  have  a  great  impact  on  the  system  perform- 
ance,  they  are  the  clock  rate  and  the  average  memory  access 
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time.  Also  the  number  of  components  on  chip  is  an  important 
factor,  since  one  of  the  most  time  consuming  operations  is 
to  transmit  data  from  one  place  to  another.  For  example  by 
being  able  to  have  more  registers  on  chip,  one  might  be  able 
to  reduce  the  average  operand  access  time  and  therefore 
speed  up  the  computer  operation.  If  one  considers  the 
storage  registers  as  part  of  the  system  memory  then  one  can 
see  that  the  average  memory  access  time  is  reduced. 

In  the  suggested  approach  to  computer  performance  evalu- 
ation, the  main  concern  is  architectural  features  and  not 
implementation  restrictions  due  to  technology  limitations. 
The  reason  for  this  is  that  a  method  to  evaluate  computer 
performance  should  be  general  and  therefore  be  able  to 
survive  constant  technological  change. 

B.   THE  COMPUTER  SYSTEM 

Any  computer  system  architecture  is  made  of  hardware  and 
software  tools.  In  the  area  of  software,  an  important  factor 
is  the  operating  system. 

For  the  sake  of  simplicity,  and  since  in  fact  the  oper- 
ating system  is  also  a  program  that  has  to  be  run  on  the 
system,  it  can  be  considered  as  part  of  the  application  in 
the  computer  performance  evaluation  process. 

If  the  operating  system  is  not  considered  as  part  of  the 
application  software  there  would  be  a  need  to  track  all 
calls  to  the  operating  system,  measure  the  time  the  system 
takes  to  execute  the  correspondent  subroutines  and  subtract 
this  from  the  program  execution  time. 

In  the  hardware,  the  major  components  are: 
i)  the  processor 
ii)  the  memory 
iii)  the  busses 
iv)  the  I/O  interfaces 
v)  glue  circuits 
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The  processor  consists  of  the  portions  of  the  computer 
made  up  of  the  control  unit,  the  arithmetic  logic  unit,  the 
general  purpose  registers  and  the  busses  that  connect  all  of 
these. 

The  memory  consists  of  all  the  parts  of  a  computer  used 
for  either  temporary  or  permanent  storage,  for  instructions 
or  for  data.  The  busses  are  a  collection  of  signal  lines 
with  multiple  sources  and  multiple  sinks.  They  provide  for 
the  intercommunication  capability  among  the  other  computer 
components.  The  I/O  interfaces  are  the  parts  of  the 
computer  through  which  the  system  communicates  with  the 
outside  world. 

In  order  for  the  overall  system  to  have  a  good  perform- 
ance, it  is  desired  to  balance  the  average  work  done  by  each 
component  per  unit  of  time.  Since  each  computer  component 
has  a  different  function,  the  work  done  by  each  is  different 
from  the  others.  It  is  this  work  that  has  to  be  character- 
ized, so  that  an  understanding  of  how  to  maximize  it,  is 
possible. 

One  requirement  is  that  the  idle  time  for  each  component 
should  be  as  low  as  possible.  For  example  the  processor 
should  be  in  an  idle  state  for  a  data  element  stored  in 
memory  as  little  as  possible. 

1.   Memory  and  I/O  Interface 

Both  memory  and  I/O  interface  can  be  considered 
together,  since  both  are  communication  media.  Memory 
performs  a  communication  between  two  instants  in  time.  I/O 
interfaces  perform  a  communication  between  the  computer 
system  and  the  outside  world. 

For  both  memory  and  I/O  the  work  is  characterized  by 
how  long  it  takes  to  correctly  receive  a  unit  of  information 
from  the  bus  and  how  long  it  takes  to  correctly  place  the 
same  unit  of  information  on  the  bus.  This  unit  of  informa- 
tion will  be  the  same  in  the  case  of  instructions  and  data. 
This  unit  of  information  is  then  one  bit. 
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For  both  memory  and  I/O,  the  measure  of  their 
performance  is  the  number  of  bits  that  are  received  or 
transmited  per  unit  of  time.  This  is  in  fact  no  more  than  a 
bandwidth  in  units  of  bits  per  second. 

For  example,  a  memory  unit  with  a  word  size  of 
sixteen  bits  and  an  access  time  of  two  microseconds  performs 
the  same  work  as  another  memory  with  a  word  size  of  thirty 
two  bits  and  an  access  time  of  four  microseconds. 

2.   The  Busses 

The  function  of  a  bus  is  to  pass  information  from  a 
computer  component  acting  as  a  source  to  other  components 
acting  as  sinks.  The  memory  and  I/O  interfaces  are  also 
communication  media  that  treat  data  and  instructions  in  the 
same  way. 

The  nature  of  these  signals  has  no  influence  on  the 
characterization  of  the  bus  work  or  the  efficiency  with 
which  the  bus  preforms  its  work. 

The  bus  work  is  characterized  by: 

i)  the   number  of   active   sources   at  a   time,    here 
assumed  to  be  one 

ii)  the  number  of  active  sinks 

iii)  the  number  of  signal  lines,  i.e. ,  the  bus  width 

iv)  the  bus  cycle  time 

As  its  function  is  to  be  a  communication  medium,  the 

bus  work   is  measured  by   a  bandwidth   in  units  of   bits  per 

second. 

The  particular  bus  bandwidth  will  be  given  by: 


SCT  (^-2) 

where 

SI  -  is  the  number  of  active  sinks 
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WI  -  is  the  bus  width 
BCT  -  is  the  bus  cycle  time 
3.   The  Processor 

After  receiving  data  and/or  instructions  from  the 
bus,  the  processor  alters  this  data  according  to  the 
sequence  of  instructions  and  then  delivers  the  final  results 
back  to  the  bus. 

While  the  previous  computer  components  treat  data 
and  instructions  in  the  same  manner,  this  is  not  true  for 
the  processor  case.  In  this  case,  instructions  specify  the 
operations  that  have  to  be  performed,  and  the  data  consti- 
tutes the  object  on  which  the  operations  are  performed. 

The  structure  of  the  processor,  i.  e.  ,  the  specific 
configuration  of  each  element  is  dependent  on  the  instruc- 
tion set  and  on  the  data  types  involved.  The  instruction 
set  configuration  makes  requirements  on  the  processor, 
because  the  instruction  set  is  intimately  related  to  the 
processor  control  unit  and  the  datapath. 

The  data  types  involved  in  an  application  should  be 
supported  by  the  processor.  If,  for  example,  a  lot  of  array 
manipulation  is  done,  then  it  is  to  be  expected  that  the 
system  considers  some  parallel  operation  capability. 

In  addition  to  the  data  types,  the  instruction  set 
is  also  dependent  on  the  application.  Therefore  the 
processor  structure  is  also  dependent  on  the  application. 

C.   THE  APPLICATION 

An  application  is  characterized  in  the  same  way  indepen- 
dent of  the  computer  system  being  evaluated.  It  is  charac- 
terized by  a  certain  number  of  tasks  that  have  to  be  done. 
Each  task  is  executed  with  a  certain  frequency.  For  each 
task  and  for  each  system  there  will  correspond  a  program 
written  with  that  system  instruction  set. 

The  frequency  of  execution  of  each   task  is  given  by  the 
number  of  times  (n),   that  this  task  is  executed  in  a  sample 
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of  N  tasks.  So  the  frequency  of  execution  of  each  task  is 
nothing  more  than  the  probability  of  this  task  being  in 
execution  at  any  given  time. 

where 

^L   -  is  the  frequency  of  execution  of  task  i 

W      -   number  of  times  the  task   i  was  executed  in  a 

big  sample 
M   -  total  number   of  tasks  that  were   executed  in 
that  sample 
For  each  task  there  is  a  corresponding  computer  program. 
This  program  will  take  some  time  to  execute. 

The  weight  of  each  task  or  its  representation  in  the 
application  will  be  given  by  the  product  of  its  execution 
frequency  and  its  program  execution  time  in  the  system  under 
study.  . 


^i   =   ^e  -  i;  (A.A) 


where 

W,"  -  weight  of  the  task  i  in  the  particular  appli- 
cation and  for  the  system  in  study 
It  -  execution  time  of  the  correspondent  program 
By  this  it  is   seen  that  the  weight  of  the   task  is  both 
dependent   on   the   application  choice   and   on   the   system 
choice. 

A  program  is  a  sequence  of  instructions.  Its  execution 
time  can  be  divided  into  smaller  pieces  where  only  one 
instruction  is  executed.  In  this  way  the  program  execution 
time  is  given  by  a  sum  of  products.  Each  element  of  the  sum 
will  be  referred   to  a  single  instruction,    and  consists  of 
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the  product  of  the  instruction  execution  time  and  the  number 
of  times  each  instruction  is  executed. 

Therefore  each  element  of  the  sum  will  be  given  by: 

where 

K/^-  -  is  the  number  of  times  that  the  instruction  j 
is  executed  for  the  particular  program 
XXT.  -  execution  time  of  instruction  j 
The  program  execution  time  will  be  given  by: 


J 


where 

^J  -  the   weight  of   instruction  j   in  the   system 

instruction  set  and  for  the  particular  task 
J   -  the   total   number   of   instructions   in   the 
system  instruction  set 
Finally,   the  weight   of  the  application  for   the  system 
under  study  will  be  given  by  the  weighted  sum  of  its  tasks. 
So, 


X 


H-. 


L        V/;  (£,.7) 


t.  =  t 


but  since 


W;  =  fc  '  T;  C-iM 
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then 


But 


and 


So, 


X 


W^=    1_     ^,-kT-  i^.e) 


!.=  > 


J- 

T(  =    Zl    S  •  (_u) 


Sj  .    KJj  .  IXT^  (.s) 


D.   THE  PERFORMANCE 

A  comparison  is  made  between  the  weights  that  an  appli- 
cation has  in  two  different  systems.  In  this  chapter,  where 
a  timing  analysis  is  done,  the  weight  of  an  application 
involves  the  execution  time  of  each  instruction  and  the 
dynamic   frequency  of  execution  of  the  same  instructions. 

The  performance  will  be  given  by  the  ratio  of  these  two 
weights. 
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where 

VJ(J  -  is  the   weight  of  the   particular  application 

for  the  reference  system 
W^  -  is  the  weight  of  the  same  application  for  the 
system  being  considered 
Note   that  the   two  systems   either   have  two   different 
instruction  sets  or   the  time  of  execution   of  each  instruc- 
tion is  different  or  both. 
So, 
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where 

I  -  is  the  total  number  of  tasks  in  the  particular 

application.   It  is  the  same   as  the  number  of 

programs. 
J  -  is   the  total   number  of   instructions  in   the 

reference  system  instruction  set 
K  -  is   the  total   number  of   instructions  in   the 

system  in  study  instruction  set 
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Considered  in  this  way  the  measure  of   the  performance 
for  a  system  is  better  the  larger  the  ratio. 

E.   A  SPECIAL  CASE  AND  THE  RISC 

If  the  application  involves  only   one  task  and  therefore 
only  one  program,  the  performance  would  be  given  by. 


L   M  .  XxT- 
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Let  us  now  consider  the  RISC  philosophy.  For  this  case 
the  value  of  J  is  fixed. 

The  RISC  proponents  advocate  that  by  reducing  the  total 
number  of  instructions  in  the  instruction  set  i.  e.  ,  by 
reducing  the  value  of  K,  the  performance  of  the  system 
inceases.  They  also  advocate  that  the  instruction  execution 
time  for  each  instruction  is  reduced  by  having  a  simpler, 
more  straightforward  machine  with  better  performance. 

Their  argument  is  that  the  value  of  the  denominator  is 
reduced  because  the  two  previous  factors  compensate  for  the 
necessary  increase  in  the  number  of  times  each  instruction 
is  executed.  By  reducing  the  denominator  the  system  will 
have  a  better  performance. 

F.   THE  SYSTEM  ARCHITECTURE  AMD  TIMING 

As  has  just  been  seen,  the  particular  choice  of  applica- 
tion determines  the  dynamic  frequency  of  execution  of  each 
instruction  in  the  instruction  set.  To  continue  the  study, 
there  is  now  a  need  to  analyze  how  the  system  architectural 
characteristics  influence  the  system  performance. 
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The  system  structure  and  its  instruction  set  are  neces- 
sarily related.  For  every  instruction,  the  system  has  to 
have  the  necessary  support  in  terms  of  the  control  unit  and 
the  datapath.  Also,  any  new  enhancement  to  the  system 
architecture  will  affect  the  execution  time  of  one  or  more 
instructions.  Therefore  it  will  always  affect  the  average 
instruction  execution  time. 

The  model  under  discussion  considers  that  each  instruc- 
tion has  a  certain  associated  weight,  this  weight  being 
dependent  on  the  application  and  on  the  system  architecture. 
The  application  determines  the  number  of  times  each  instruc- 
tion is  executed,  i.e.,  the  dynamic  frequency  of  execution 
of  the  instruction.  The  system  architecture  determines  the 
execution  time  of  each  instruction.  It  is  this  execution 
time  that  will  now  be  studied. 

We  define  the  Life  Cycle  of  an  instruction  (LC)  as  the 
time  period  beginning  at  the  instant  the  instruction  is 
first  fetched  from  memory  and  ending  at  the  instant  the 
final  results  produced  by  the  operation  are  stored  back  in 
memory. 

The  instruction  execution  time  will  then  be  some  portion 
of  its  time  life  cycle.  This  portion  will  be  dependent  on 
the  system  architectural  characteristics  such  as  pipelining, 
parallel  processing,  instruction  prefetching,  instruction 
queue,  etc. 

The  main  phases  through  which  an  instruction  has  to  pass 
in  its  life  cycle  are: 
i )  Fetching 
ii)  Execution 

The  time  the  system  takes  to  fetch  an  instruction  is 
dependent  on  the  instruction  bus  width,  the  instruction 
length  and  the  bus  cycle  time  in  the  following  way: 

±  TroS.TRo  CTvoM        LEmC^TH  ,  ^  //     ,,\ 
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This  value  for  the  fetch  time  will  be  an  average,  more 
or  less  rigorous,  depending  on: 

i)  instruction  size  (  fixed  or  variable  ) 

ii)  the  availability  of  the  instruction  queue 

Not  all  the  instructions  have  the  same  structure,  but 
nevertheless,  all  of  the  instructions  accomplish  some  trans- 
formation on  some  data.  The  data  might  be  one  or  more  oper- 
ands and  the  final  result  in  the  case  of  an  arithmetic 
instruction,  or  the  data  might  be  the  contents  of  the 
program  counter  in  the  case  of  a  branch. 

In  order  for  the  system  to  be  able  to  accomplish  the 
transformation  required  by  the  instruction,  it  has  to: 

1)  decode  the  instruction 

2)  locate  the  data  (  e. g. ,  addressing  modes  ) 

3)  place  the   data  in   a  convenient   location  to   be 
transformed,  if  it  is  not  there  already 

4)  perform   the   transformation   asked   for  by   the 
instruction 

5)  relocate  the  data  in  a  convenient  location. 
Whether   these   phases   are  performed   in   a   sequential 

fashion  or  in  parallel  depends  on  the   system  architecture. 

For  example,   suppose  that  the  instructions  followed  a  fixed 

format  with   separate  and  predefined   fields  for   OPCODE  and 

ADDRESSSING.    Then   it   would  be   possible   to   decode   the 

instruction  and  the  address  field  simultaneously. 

In  order  for   the  system  to  process   the  addressing  mode 

and  depending  on  the  particular  address  mode,  it  may  have  to 

do  one  or  more  of  the  following: 

preform  data  transfers  either  register-to- 
register  or  memory-to-register; 

preform  some  addition  e. g. ,  in  the  case  of  base 
addressing,  index  addressing  or  branch 
addressing; 

preform  some  multiplication  e.  g.  ,  in  the  case  of 
the  VAX- 11  index  mode. 

For  the  sake   of  simplicity  one  could    consider  all  the 

data   transfers   that   have   to  be   done   while   the   system 
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executes  a   program  and  determine   an  average  time   for  data 
transfer. 

Typically  if  the  system  has  on-chip  registers,  cache 
memory  and  main  memory,  the  value  for  the  average  data 
transfer  time  will  be: 

\t    -     —  C^AT)  ^     ^   (CAT)  +  -^  CV\A1)       ik.\S) 

where 

R  -  number  of  register  accesses 

C  -  number  of  cache  accesses 

M  -  number  of  main  memory  accesses 

T  -  total  number  of  data  transfers 

RAT  -  register  access  time 

CAT  -  cache  access  time 

MAT  -  memory  access  time 
and 

T  =  R  +  C  +  M  ik-\^) 

In  summary,  in  the  instruction  life  cycle  one  has: 

TF  -  fetching  time 

TDEC  -  decoding  time 

TLOC  -  locating  data  (  address  mode  ) 

TDATA  -  access  data 

TOP  -  perform  the  operation 

TW  -  write  the  final  results 
If  the   system  performs   all  of  these   time  phases   in  a 
sequential  fashion   so  that  there   is  no  overlap,    then  the 
instruction  time   life  cycle  will   just  be  the   summation  of 
all  the  time  phases: 
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LCno  =  TF+TDEC+TLOC+TDATA+TOP+TW    (no  overlap)  C^.H) 

If  some  overlap  among  the  phases  is  present,  then  the 
instruction  time  life  cycle  will  be  some  portion  of  the 
previous  value  (no  overlap  case). 

LCo  =  y  *  LCno      (overlap  case)        C^.IJ) 

where 

y  -  is  a  coefficient  that  measures  the  efficiency 

of  the  architectural  scheme   that  accounts  for 

the  overlap   possibility.   Its   value  will   be 

always  between  zero  and  one. 

Some   of  the   architectural   characteristics  that   might 

influence  the  value  of  "  y  "are: 

-  separate  or   common  memories   for  data   and  instruc- 
tions, 

-  instruction  format 

-  instruction  type 

-  bus  width 

-  dual  port  memories 

The  architectural  characteristics  will  also  determine 
the  amount  of  overlap  execution  among  different  instruc- 
tions. The  efficiency  of  this  overlap  will  then  determine 
what  portion  of  the  instruction  time  life  cycle  value  will 
be  the  instruction  execution  time  (IXT). 

IXT  =  w  *  LCo  l^.l^) 

where 

IXT  -  instruction  execution  time 
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w  -  efficiency  of  the  overlap   among  the  time  life 
cycles   of   different    instructions.    Values 
ranging  from  zero  to  one. 
The  value  of  w,   that  is  the  amount  of   overlap  will  be 
determined  by  several  architectural  characteristics  such  as: 

-  pipelining 

-  prefetching 

-  instruction  queue 

-  parallel  processing 

-  instruction  length 

-  bus  width 

-  datapath 
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V.  CONTROL  ANALYSIS 

A.   INTRODUCTION 

In  the  previous  chapters  a  timing  analysis  of  the  system 
operation  was  presented.  In  it  a  study  was  made  first  of  the 
application  effects  on  performance  through  the  dynamic 
frequency  of  execution  of  each  instruction,  and  second  of 
the  system  architecture  effects  on  performance  through  the 
execution  time  of  each  instruction. 

Finally  to  complete  the  model  being  suggested,  one  has 
to  consider  the  requirements  that  the  instruction  set  poses 
on  the  system  in  terms  of  the  required  control  complexity. 

These  requirements  will  also  be  dependent  on  the 
application. 

This  is  also  important  since  no  matter  what  technology 
is  used  in  the  system  implementation,  the  number  of 
resources  available  on-chip  will  always  be  limited. 

Typically  the  control  unit  is  implemented  using  either 
microcode  or  is  hardwired  e.  g.  ,  using  programmable  logic 
arrays.  Some  of  the  factors  that  impact  the  choice  are: 

•  instruction  set  complexity 

•  required  control  unit  size 

•  possibility  of  future  changes  in  the  instruction  set 

•  speed 

The  size  of  the  control  unit  (i.e. ,  the  number  of  gates 
needed  to  implement  the  control  unit)  will  determine  the 
space  available  on-chip  for  other  components.  In  the  case  of 
the  RISC  I  and  II  the  smaller  control  unit  and  therefore  the 
smaller  power  consumption,  allowed  the  designers  to  add  more 
registers  to  the  processor  chip.  With  the  choice  of  addi- 
tional hardware  for  the  processor  ,  the  designers  in  fact 
reduce  the  average  memory  access  time  if  one  considers  the 
registers  as  also  part  of  the  system  memory. 
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B.   THE  CONTROL  UNIT  AS  A  FINITE  STATE  MACHINE 

The  control  unit  of  a  computer  system  can  be  viewed  as  a 
finite  state  machine,  and  therefore  can  be  analyzed  as  such. 
If  analyzed  in  that  way,  the  control  unit  operation  can  be 
described  by  a  state  diagram.  In  its  most  simple  and  most 
general  case,  the  state  diagram  will  typically  have  only  two 
states,  see  Figure  5.  1. 
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Figure  5. 1    Simple  Control  Unit  State  Diagram. 

In  a  more  detailed  analysis,  the  control  unit  state 
diagram  will  have  a  tree  like  format  where  any  vertical  path 
will  correspond  to  the  execution  of  an  instruction,  see 
Figure  5.  2. 

In  this  case,  each  and  every  instruction  is  identified 
and  each  state  although,  still  belonging  to  one  of  the  two 
major  phases  fetch  and  execute,  will  now  correspond  to  a 
microstep  in  the  control  unit  output  sequence  while  the 
system  is  executing  a  program. 
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Figure  5. 2    More  Detailed  Control  Unit  State  Diagram. 

Of  course  this  is  complicated  if  the  system  is  able  to 
deal  with  more  than  one  instruction  at  a  time.  Nevertheless 
the  complexity  of  the  controller  can  always  be  associated 
with  the  number  of  states. 

C.   THE  CONTROL  UNIT  COMPLEXITY 

Not  all  the  states  will  count  in  the  same  fashion  since 
there  are  states  that  will  be  common  to  more  than  one 
instruction  or  vertical  path. 
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The  number  of  these  shared  states  will  depend  both  on 
the  processor  instruction  set  itself  and  on  the  implementa- 
tion choices  made  by  the  processor  designer.  For  example,  in 
this  last  case  the  processor  designer  could  make  use  of 
microcode  subroutines  to  be  shared  or  called  by  more  than 
one  instruction. 

If  states  are  shared  among  instructions,  then  there  will 
always  be  some  trade-off  between  the  total  number  of  states 
of  the  control  unit  and  its  speed.  This  tradeoff  is  due  to 
the  fact  that  when  states  are  shared  among  different 
instructions,  the  control  unit  has  to  have  some  feedback 
capability.  The  specific  value  of  the  feedback  will  force 
the  next  state  of  the  control  unit,  when  the  vertical  paths 
corresponding  to  the  instructions  will  ultimately  separate 
themselves. 

No  matter  what  this  feedback  will  be,  it  will  always 
have  some  cost  related  to  it.  The  cost  is  the  extra  time  it 
takes  for  the  values  of  the  feedback  signals  to  be  valid. 
Since  the  cost  is  time,  it  will  be  reflected  in  the  average 
instruction  execution  time,  and  so  affect  the  performance  of 
the  system  in  the  portion  the  model  described  in  the 
previous  chapter. 

In  this  part  of  the  model  we  focus  on  the  comparisons  of 
two  control  units. 

The  complexity  of  a  particular  instruction  will  then  be 
dependent  both  on  the  number  of  states  it  has  and  on  the 
number  of  states  which  are  shared  by  more  than  one 
instruction. 

The  cost  of  adding  a  new  instruction  to  a  certain 
processor  instruction  set  is  the  number  of  new  states  that 
have  to  be  added  to  the  control  unit  state  diagram.  The 
addition  of  this  instruction  will  have  a  cost  on  the  system 
performance  that  can  be  minimized  by  maximizing  the  number 
of  states  necessary  to  its  execution  that  are  already  in 
existence  in  the  control  unit  state  diagram. 
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Returning  to   the  control  unit   the  number  of   states  is 

then  dependent  on: 

i)  the  number  of  instructions 

ii)  the  number   of  states   that  are   common  to   more 
than  one  vertical  path  (or  instruction) 

iii)  the  average  height  of  each  instruction 

Where  the  height  of  one   instruction  is  defined   as  the 

number  of  states  in  its  vertical  path. 

I>.   THE  APPLICATION  AND  THE  CONTROL  UNIT 

In  the  previous  chapter  the  instruction  set  and  the 
dynamic  frequency  of  execution  of  each  instruction  together 
with  the  instruction  execution  time  were  considered.  Now 
one  wants  to  know  how  effective,  the  control  unit  is  for  the 
application  where  the  processor  is  being  used. 

It  has  already  been  seen  that  the  complexity  of  the 
control  unit  is  related  to  the  number  of  states.  One  knows 
that  a  smaller  and  simpler  control  unit  has.  an  effect  on  the 
processor  performance,  because  more  space  would  be  available 
on-chip  for  other  resources.  One  choice  might  be  to  add  new 
registers  to  the  processor  chip  and  thus  try  to  decrease  the 
average  memory  access  time. 

One  also  wants  to  minimize  the  number  of  instructions 
that  are  needed  in  order  to  perform  a  certain  task,  so  one 
has  to  go  back  to  the  application.  An  application  is  char- 
acterized by  a  certain  number  of  tasks  that  have  to  be  done. 
Each  task  is  performed  with  a  certain  frequency.  For  each 
task  a  program  will  have  to  be  written  using  the  instruction 
set  available.  Each  program  corresponds  to  a  sequence  of 
instructions  used  to  perform  the  corresponding  task. 

Directly  from  the  program  it  should  be  possible  to 
compute  the  static  frequency  of  each  instruction.  But  that 
is  not  the  only  frequency  that  is  of  interest  to  the 
performance  evaluation  process.  The  dynamic  frequency  of 
execution  is  more  important. 
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The  two  frequencies  will  be  different  for  each  instruc- 
tion depending  on: 

i)  program  sequence 

ii)  conditional  branches  and  the  most  frequent  values  of 
•  the  variables  on  condition. 

The  execution  of  a  program  is  then  a  sequence  of  several 
instructions  execution. 

Since  a  single  instruction  corresponds  to  a  vertical 
path  in  the  processor  control  unit  state  diagram,  the  execu- 
tion of  a  program  will  then  be  an  up  and  down  walk  on  the 
state  diagram. 

When  comparing  two  control  units,  the  one  that  would 
have  to  execute  fewer  instructions,  supposing  that  the 
average  height  of  an  instruction  would  be  the  same  for  both 
control  units,  will  be  the  best.  The  height  of  an  instruc- 
tion is  in  fact  a  measure  of  what  the  RISC  proponents  call 
the  instruction  complexity.  Because  it  would  be  natural  that 
two  different  processors  have  instruction  sets  with 
different  values  for  the  average  height  of  an  instruction, 
the  bottom  line  is  that  the  comparison  of  two  control  units' 
complexity  cannot  be  done  through  the  counting  of  instruc- 
tions executed,  but  through  the  counting  of  the  number  of 
states  through  which  each  control  unit  has  to  pass  when  the 
system  executes  a  typical  application  program. 

It  is  to  be  expected  that  if  one  wants  to  add  an 
instruction  to  a  processor  instruction  set,  the  control  unit 
will  suffer  by  an  expansion.  For  a  hardwired  implementation 
e.g. ,  using  PLA's  these  will  have  to  grow;  for  a  microcode 
implementation  typically  there  will  be  a  need  to  increase 
the  size  of  the  microcode  memory.  The  amount  of  the  control 
unit  expansion,  will  be  dependent  on  the  implementation,  on 
the  instruction  itself,  and  on  the  designer's  choice 
regarding  the  number  of  states  that  will  be  shared  with 
existing  instructions.  There  is  a  relation  between  the 
number  of  gates  used  in  order   to  implement  a  controller  and 
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the  number  of  states  present  on  the  controller  state 
diagram. 

Because  there  is  a  direct  and  individual  relation 
between  the  control  unit  states  and  the  gates  that  compose 
the  control  unit,  and  because  one  wishes  to  use  each  and 
every  one  of  these  gates  a  similar  number  of  times  in  order 
to  increase  the  overall  efficiency,  then  for  better  effi- 
ciency it  is  desirable  that  all  states  are  used  in  a 
balanced  way.  With  some  similarity  one  might  say  that  the 
efficiency  of  the  use  of  an  instruction  set  increases  when 
all  the  instructions  in  that  instruction  set  tend  to  be  used 
an  equal  number  of  times. 

An  application  has  an  indirect  relation  to  the  number  of 
states  through  which  the  control  unit  has  to  pass  in  order 
for  the  system  to  execute  the  corresponding  programs. 

In  the  optimum  case  the  control  unit  will  have  the 
following  characteristics: 

i)  minimum  number  of  gates 

ii)  for  the  specific  application  all  states  will  be  used 
in  a  balanced  number  of  times 

iii)  no  state  exists  that  will  never  be  used. 

E.   THE  MODEL 

Assume  that  a  control  unit  has  a  total  number  of  states 
T.  Associated  with  each  state  there  will  be  a  certain 
number  of  gates.  This  number  will  be  dependent  on  the  imple- 
mentation choice,  either  microcode  or  hardwired  logic.  Of 
these  T  states,  an  application  uses  S  states,  and  of  these  S 
states  some  states  will  be  used  more  than  others. 

The  weight  of  the  application  is  related  to  the  number 
of  states  through  which  the  control  unit  has  to  pass  in 
order  to  execute  the  corresponding  programs. 

Each  state  has  some  weight  associated  with  it.  This 
weight  will  be  dependent  on: 

i)  the  number  of  times  the  state  is  used 

ii)  the  num.ber  of  instructions  that  share  the  state 
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iii)  the  number   of  gates   needed  for   implementing  each 
state. 

The  complexity  of  an  instruction  will  be  related  to  its 
height,  that  is  the  number  of  states  in  the  corresponding 
vertical  path  in  the  control  unit  state  diagram. 

So, 


W 

z 


Cj  ^   L.   VJj^  iSA) 


where 


C\  -  complexity  of  the  instruction  j 

Wj^  -  weight  of  state  h 

W    -  height  of  the  instruction  j 


and 


v4=  -L.  is.z) 


(^  -  number  of  gates  per  state  (  implementation  ) 
U^  -  number  of  instructions  to   which  the  state  is 
common 
The  weight  of  an  instruction  will   be  the  product  of  the 
number   of  times   the  instruction   is  executed   for  a   given 
program  times  the  instruction  complexity. 
That  is 


VJ^  .  M^  «  C^  C^.i) 


where 

^\      -    number  of  times  the  instruction  j  is  executed 
As  in  the  previous  chapter,   the  weights  of  the  task  and 
the  application  will  be: 
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where 

^i   -   weight  of  task  i 

V,'  -  frequency  of  task  i  for  a  certain  application 

X  -  number  of  instructions  in  the  instruction  set 

For  an  application  its  weight  will  be: 

X      J- 


or 


VJ. - 


1  r:   L  nJ;   L   A  (s.o 


'"     -^     '    ,-.      ^   J^..  Uj, 


where 

Wq  .-  weight  of  the  application 

T   -  number   of    tasks   in   the    application   of 

interest 
T  -  number   of   instructions    in   the   processor 

instruction  set 
H  -  height  of  each  instruction 
Similar  to  the  timing  analysis   in  the  previous  chapter, 
the  performance  of  the  system  under  study  will  be  given  by: 


Tlrf  .    Mi.  C^-"') 


where 

'^a.  ~    weight  of   the  application  for   the  reference 

system 
^a.    -   weight  of  the  same  application  for  the  system 
being  considered 
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So, 


ta  I 


9  r       '""       ^i='       -t.,    UJL  ,_,^. 


K 


r  f .  i:  M.  E  ^ 


c- 1     M.~\ 


where 

^  -  number  of  tasks  (programs)  in  the  application 
«J  -  number   of   instructions    in   the   reference 

system  instruction  set 
K  -  number  of   instructions  in   the  system  under 

study  instruction  set 
U     -  height   of   instruction  j   in   the   reference 

system  instruction  set 
L  -  height  of   instruction  k  in  the   system  under 

■  study  instruction  set 
N-  -  number   of  times   instruction   j  is   executed 

while   the   reference    system   executes   the 

typical  application  program 
1^^  -  number  of  times  the  instruction  k  is  executed 

while   the  system   under   study  executes   the 

same  program 
<^o  -  number  of   gates  per   state  in   the  reference 

system  control  unit 
^1  -  number  of  gates  per  state  in  the  system  under 

study  control  unit 
L/_^  -  number  of  instructions  that   share  state  h  in 

the   reference   system    control   unit   state 

diagram 
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U^  -  number  of  instructions  that  share  state  1  in 
the  system  under  study  control  unit  state 
diagram. 
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VI.  CASE  ANALYSIS 

A.  INTRODUCTION 

As  an  example  we  will  analyze  the  change  in  performance 
of  a  particular  application  program  when  some  floating  point 
capability  is  added  to  a  processor  which  currently  performs 
fixed  point  arithmetic. 

In  this  case  study,  the  performance  effects  of  the 
program  code  sequence  will  not  be  considered.  These  effects 
are  mostly  due  to  any  capability  of  the  processor  related 
to: 

•  pipelining 

•  parallel  processing 

Specifically,  the  case  consists  in  the  possible  addition 
of  a  floating  point  multiply  instruction  to  a  processor 
instruction  set.  The  processor  that  was  chosen  was  the 
Motorola  MC580Q0.  The  application  for  this  evaluation  is 
the  computation  of  a  Fast  Fourier  Transform. 

B.  THE  ADDITION  OF  AN  INSTRUCTION 

The  addition  of  an  instruction  to  the  original  instruc- 
tion set  has  several  consequences. 

First  of  all  if  a  hardwired  controller  is  used  the 
processor's  control  unit  must  be  expanded  so  that  the 
instruction  is  incorporated.  The  amount  of  the  control  unit 
expansion  is  dependent  on  the  number  of  new  states  that  the 
instruction  under  consideration  will  add  to  the  control  unit 
state  diagram  and  also  on  the  control  unit  implementation. 

In  fact,  one  of  the  reasons  to  use  microcode  in  the 
implementation  of  an  instruction  set  is  due  to  the  flexi- 
bility it  gives  in  any  future  changes  of  the  instruction 
set. 
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Second  and  depending  on  the  operation  performed  by  the 
instruction,  some  hardware  will  have  to  be  added  to  the 
processor.  The  amount  of  hardware  that  will  have  to  be  added 
to  the  processor  is  dependent  both  on  the  hardware  that 
already  exists  on-chip,  that  the  instruction  might  use  and 
is  dependent  also  on  how  fast  one  wants  the  instruction  to 
operate. 

The  addition  of  more  hardware  to  the  processor  will 
cause  a  rise  in  the  power  consumed  by  the  processor.  Due  to 
a  limited  power  dissipation  capability,  the  net  effect  of 
the  increase  in  the  number  of  gates  that  constitute  the 
control  unit  and  the  datapath  will  be  a  reduction  in  the 
size  of  existing  processor  components  or  a  migration  of  some 
off-chip,  so  that  the  power  consumed  by  the  processor  stays 
constant. 

One  choice  might  be  to  replace  some  of  the  registers 
available  on-chip  by  the  hardware  necessary  for  the  new 
instruction.  By  reducing  the  number  of  registers  on-chip, 
there  will  be  a  decrease  in  the  ratio  of  register  accesses 
to  the  number  of  main  memory  accesses. 

In  the  case  of  a  Load/Store  architecture  such  as  the 
RISC  architecture,  a  reduction  in  the  number  of  registers 
will  cause  an  increase  in  the  dynamic  frequency  of  execution 
of  LOAD  and  STORE  instructions  relative  to  the  other 
instructions. 

In  a  traditional  architecture,  where  the  LOAD  and  STORE 
instructions  are  not  the  only  memory  reference  instructions, 
the  effect  of  reducing  the  number  of  on-chip  registers  is  an 
increase  in  the  average  instruction  execution  time  because 
the  proportion  of  memory  accesses  to  register  accesses  will 
increase. 

This  increase  in  average  instruction  execution  time  will 
cause  an  increase  in  the  typical  application's  program 
execution  time.   It  is  this  increase  in  execution  time,  that 
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will  have  to  be  overcome  by  the  addition  of  the  new  instruc- 
tion to  the  processor  instruction  set,  so  that  in  fact  the 
program  execution  time  might  suffer  a  reduction  rather  than 
an  increase. 

C.   THE  COST/GAIN  TRADEOFF 

The  floating  point  multiply  instruction  after  being 
added  to  the  processor  instruction  set,  will  replace  the 
sequence  of  instructions  that  the  processor  had  to  execute 
every  time  a  multiplication  of  two  floating  point  numbers 
was  called  for. 

In  order  for  the  addition  of  the  floating  point  multiply 
instruction  to  be  considered,  the  instruction  has  to  pass 
several  tests.  The  first  test  requires  the  instruction 
execution  time  to  be  smaller  than  the  correspondent  instruc- 
tion sequence  execution  time. 

If  that  is  not  the  case,  then  there  is  no  point  in 
adding  the  instruction  to  the  processor  instruction  set. 

So,  consider: 

Ini  -  execution  time  of  the  new  instruction 

Iseq  -  execution  time  of  the  corresponding  sequence  of 
instructions 

For  the  addition  of  the  new  instruction  to  be  consid- 
ered: 

Ini  <  Iseq  (6.1) 

Assume  then  that  in  fact  the  above  condition  is  true, 
then 

Iseq  =  Ini  +  Igain  C^-2.) 


or 
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Ini  /  Iseq  =  c  L^-^) 

where  c  <  1 

For  the  sake  of  simplicity,  consider  that  the  applica- 
tion of  interest  is  composed  of  only  one  task.  That  is  to 
say  that  the  effects  on  the  processor  performance  will  be 
considered  only  within  the  context  of  a  program. 

The  model  suggested  for  computer  performance  evaluation 
has  two  parts,  a  timing  analysis  and  a  control  unit 
complexity  analysis.  These  two  parts  of  the  model  will  give 
rise  to  two  distinct  criteria  to  which  the  addition  of  the 
instruction  will  have  to  comply.  So  that  the  gain  in  the 
processor  performance  that  is  obtained,  will  surpass  the 
reduction  or  cost  in  the  processor  performance  due  to  the 
requirements  brought  by  the  same  instruction  to  the 
processor  architecture. 

1.   Timing  Criterion 

The  timing  model  says  that  the  effects  of  the  addi- 
tion of  one  instruction  to  the  system  instruction  set,  on 
the  system  performance  will  be  measured  by: 
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where 

J"  -  is  the  number  of  instructions  on  the  original 
system  instruction  set 
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N-^    -   number   of  times   that  the   instruction  j   is 
executed  before   the   addition  of   the   new 
instruction  to  the  processor  instruction  set 
Lj  -  execution  time   of  the  same  instruction   j  on 

the  original  system 
^*j -  number   of  times   that  the   instruction  j   is 
executed   after   the   addition    of   the   new 
instruction 
La;  -  execution  time   of  the  instruction   after  the 

addition  of  the  new  instruction 
I^Ku*/  -  number   of   times   the    new   instruction   is 

executed 
Lk«»j  -  execution  time  of  the  new  instruction 
The  numerator  is  a  measure  of  the  execution  time  of 
the  application  program  before  the  addition  of  the  instruc- 
tion under  consideration.  The  denominator  is  a  measure  of 
the  execution  time  of  the  application  program  after  the 
addition  of  the  new  instruction. 

The  sequence  of  instructions  in  the  original 
instruction  set  that  implements  the  operation  performed  by 
the  new  instruction  is  executed  a  number  of  times.  This 
number  will  be  equal  to  Nnew. 

The   sequence  execution   time   will   consist  of   the 
execution  time  of  several  instructions. 
Therefore 

J 

J"' 

where 
^uy  -    number  of  times  that  the  instruction  j  of  the 
original  instruction   set  is   executed  during 
the  sequence  of  instructions  execution. 
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then 
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and 


k  I  k\  ■  vl 


where 


Mo.  -  number   of  times   the   instruction   j  of   the 
original  instruction  set   is  executed  outside 
the  sequence. 
For  improvement  in  performance: 


Perf  >  1  C^.J) 

This  indicates  that  it  is  worthwhile  to  add  the  new 
instruction  to  the  original  instruction  set  for  this 
application. 

Then,  one  wants 

J  J  3 

4-1  4='  ^ 
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but 


I 


so 


L-     N^o-    L-      +    Nwa^    Lioc,      y     Z—    Kl<x.   La;     +  Mao-j  LwiW  ("^.lo) 


J 


The  right  term  of  the  inequality  corresponds  to  the 
increase  in  the  application  program  execution  time,  that  was 
caused  by  the  suppression  of  some  hardware  components  of  the 
processor  e. g.  ,  some  registers. 

This  increase,  caused  by  an  increase  in  the  number 
of  instructions  that  have  to  be  performed--case  of  the  LOAD 
and  STORE  instructions  in  a  Load/Store  architecture,  or 
caused  by  an  increase  on  the  average  instruction  execution 
time--case  of  a  traditional  architecture. 

Therefore 
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On  the  left  term  of  equation  5.  1 , 

Lseq  -  Lnew 
represents  the  gain   in  execution  time  that  was  obtained  by 
substituting  the   sequence  of   original  instructions   by  the 
new  instruction,  each  time  the  operation  was  performed. 

So, 

Lseq  -  Lnew  =  Timing  Gains  =  Tgain      (^•I'i) 


Then, 


Nnew  Tgain  >  Tcost  C^''^) 


or 


Nnew  >  Tcost  /  Tgain  U.\^*) 


Based  on  an  timing  analysis,  it  is  only  advantageous 
to  add  the  new  instruction  if: 


1)  Lseq  >  Lnew 


and 


(t.u) 


2)  Nnew  >  Tcost  /  Tgain  U-H) 

To   put   it  in   another  way,    the  addition   of   an 

instruction   to   a   processor  instruction   set   will   only 

increase   performance   if   that  instruction   is   executed   a 
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sufficient  number   of  times  during  the   application  programs 
execution.  The  exact  number  of  times  the  instruction  must  be 
executed  is  given  by  the  above  criterion. 
2.   Control  Unit  Complexity  Criterion 

Concerning   the   analysis   of   the   control   unit 
complexity  one  has: 
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Since  the  implementation  of  the  control  unit  will  be 
the  same  and  the  implementation  determines  the  value  of  GO, 
the  equation  simplifies  to. 


?erf   - 


J  n  Hinu/ 


((..h) 


As  in  the  timing  analysis  one  wants: 

Perf  >  1  i^-^^) 
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That  is 


As   before,    the   execution  of   the   sequence   will 
consist  on  the  execution  of  several  instructions,  then 
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Ls  -  represents  the  gain  in  the  number  of  states, 
obtained  each  time  the  operation  performed  by 
the  instruction  and/or  the  sequence  is 
executed. 
Es  -  represents  the  cost  in  the  number  of  states 
due  to  the  addition  of  the  new  instruction 
Then 


Nnew  *  Ls  >  Es  (6.2S) 


or 


Nnew  >  Es  /  Ls  (6.2  6  ) 


D.   AN  ILLUSTRATIVE  EXAMPLE 

An  example  is  now  presented  to  clarify  the  use  of  the 
model  suggested  through  the  present  and  previous  chapters/ 

The  example  quantizes  the  effects  of  adding  a  floating 
point  multiply  instruction  to  an  existing  processor  instruc- 
tion set. 

As  has  been  previously  stated,  the  values  determined  for 
the  increase  or  decrease  on  the  system  performance  will  only 
be  valid  for  a  given  application. 

1.   The  Processor 

The  Motorola   MC58000  is  selected  for   this  example. 
The   MC58000  is   a  widely   known  microprocessor   that  has   a 
simple  instruction  set  offering  no  floating  point  support. 

The  MC68000  has  a  15-bit  data  bus  and  a  32-bit 
address  bus.  In  addition  to  the  Program  Counter  and  Status 
Registers,  the  MC58000  has  seventeen  32-bit  registers.  These 
registers  are   divided  into   two  groups.   -The  first   group. 
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composed  of  eight  registers  are  general  purpose  data  regis- 
ters. The  second  group,  composed  of  the  remaining  nine 
registers  is  used  mostly  for  handling  addresses. 

In  total,  there  are  fourteen  addressing  modes  on  the 
MC68000,   although  they   can  be  studied  in   six  basic  types. 
These  addressing  modes  are  already  described  in  chapter  two 
of  this  thesis. 

The  instruction  set  of  the  MC68000  consists  of  56 
basic  instructions,  having  from  zero  to  two  addresses.  Each 
instruction  can  use  several  addressing  modes.  This  fact 
determines  that  the  MC68000  does  not  follow  a  Load/Store 
architecture. 

The  instruction  set  of  the  MC68000  supports  five 
basic  types  of  data: 

bits 

bytes  ( 8  bits) 

words  ( 16  bits) 

longwords  (32  bits) 

Packed  binary-coded  decimal  (BCD)    with  two  digits  per 
byte 

The  input/output  on  the  MC68000  is  memory-mapped, 
i. e. ,  all  I/O  interfaces  share  the  address  space  with 
memory. 

Considering  the  implementation  of  the  MC68000,  it  is 
a  single-chip  VLSI  HMOS  processor  with  a  typical  clock  rate 
between  4  and  12  MHZ  and  with  a  typical  memory  access  of  4 
clock  cycles. 

2.   The  Application 

For  the  application  we  choose  a  program  that 
computes  a  Fast  Fourier  Transform.  This  program  was 
obtained  from  '  The  Fast  Fourier  Transform'  by  E.  Gran 
Brigham  [ Ref .  4].  The  program  is  written  in  Fortran.  The 
flowchart  of  the  computation  done  by  this  program  is  on  page 
161  of  the  above  reference.  The  program  itself  appears  on 
page  164  of  the  same  book. 
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From  the  reading  of  the  program,  one  can  immediately 
verify  that  some  of  the  operations  that  are  called  for  could 
not  be  directly  implemented  with  the  MC68000  instruction 
set. 

For  these  operations  it  was  necessary  to  use  either 
subroutines  present  in  '  Microprocessor  Systems,  a  15-Bit 
Approach'  by  William  J.  Eccles  [ Ref .  5]  or  newly  written 
subroutines.  The  subroutines  to  handle  floating  point 
numbers  in  the  MC68000  came  from  Ref.  5. 

The  subroutines  that  were  written  are  shown  on 
appendixes  C  and  D,  these  subroutines  compute  the  sine  and 
the  cosine  of  an  angle,  according  to  an  algorithm  presented 
in  the  '  Software  Manual  of  the  Elementary  Functions'  by 
William  J.  Cody,  J.  R.  and  William  Waite  [Ref.  6:  pp. 
125-143] . 

The   translated   program   for   the   Fast   Fourier 
Transform  computation  is  shown  on  Appendixes  A  and  B. 
3.   The  Floating  Point  Representation 

The  floating  point  representation  that  was  chosen  is 
the  IEEE  proposed  standard  for  single  precision.  This  stan- 
dard determines  a  32-bit  long  representation  of  a  floating 
point  number,  shown  in  Figure  5.  1. 
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Figure  6.  1    Floating  Point  Representation. 

This  standard  has  the  following  characteristics: 
i)  32  bits  are  used 
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ii)  radix  of  two 

iii)  the  radix  point  before  the  first  digit  with  assumed 
one  to  the  left 

iv)  mantissa 

iv. a)  sign  position  -  0 

iv. b)  value  position  -  9-31 

iv. c)  representation  -  normalized,  sign/magnitude 
v)  exponent 

V.  a)  sign  position  -  no  sign 

V. b)  value  position  -  1-8 

V. c)  representation   -   biased    exponent,    bias   = 
127(dec) 

V. d)  range  of  exponent  -  -126  to  127 

vi )  range  of  floating  point  number  -  +-  5. 9*10**-39  to 
+-1. 7*10**38 

All  the   subroutines  that  handle  the   floating  point 
data  and  that  were  used  obey   to  this  standard,   so  does  the 
hardware  necessary  to  implement  the  floating  point  multiply. 
4.   The  Hardware  Involved 

The  general   structure  of  the  hardware   required  for 
the  implementation  of  an   additional  floating  point  multiply 

r 

instruction  in  the  MC68000  instruction  set  was  obtained  from 

the  'Introduction  to  Computer  Architecture'  [ Ref .  7: p.    80] 

and  is  shown  on  Figure  6.2. 

The  hardware  consists  of: 

i)  three  32-bit   registers,   these   can  be   some  of   the 
already  existing  data  registers  on  the.  MC68000, 

ii)  an  8-bit  adder  used  for  the  exponent  addition,  that 
could  just  be  the  adder  already  existing  on  -the 
MC68000, 

iii)  a  multiplier  used  for  the  mantissa  multiplication, 

iv)  an  exclusive-or  gate  for  the  product  sign  calcula- 
tion, 

v)  a  normalizer  and  converter 

With  the   hardware  structure  that   was  chosen   it  is 

possible  to   perform  in   parallel  the   determination  of   the 

sign  of  the  result,   the  addition  of  the  two  exponents,   and 

the  multiplication  of  the  two  mantissas. 
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Figure  6. 2    General  Hardware  Structure  for  the 
Floating  Point  Multiply  Instruction. 

The  execution  time  of  the  floating  point  multiplica- 
tion instruction  will  then  be  determined  by  the  slowest  of 
these  three  distinct  and  parallel  operations. 

The  sign  computation  involves  just  one  exclusive-or 
gate  gate  and  therefore  takes  a  maximum  of  one  clock  cycle. 

The  addition  of  the  two  exponents  involves  in  fact 
the  addition  of  the  two  exponents,  followed  by  the  subtrac- 
tion of  the  bias  since  this  has  also  to  be  performed  concur- 
rently with  the  determination  of  exponent  overflow  or 
underflow. 

From  [ Ref.  7]  the  addition  of  the  contents  of  two 
registers  using  the  MC68000,  takes  4  clock  cycles  to 
complete.  After  this  addition  an  extra  clock  cycle  will  be 
taken  for  the  determination  of  exponent  overflow  and  under- 
flow  together  with   the  subtraction   of   the   extra   bias. 
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Therefore  it  is  concluded  that  the  addition  of  the  two  expo- 
nents will  take  a  maximum  of  5  clock  cycles. 

For  the  mantissas  multiplication,  a  multiplier  will 
have  to  be  added  to  the  processor  hardware.  According  to 
"Digital  Systems:  Hardware  Organization  and  Design  by 
Frederick  J.  Hill  and  Gerald  R.  Peterson  '  [ Ref .  8]  the 
multiplier  structure  that  gives  the  best  cost/performance 
tradeoff  in  terms  of  the  hardware  involved  and  the  time  it 
takes  to  perform  a  multiplication  is  a  multiplier  that  uses 
a  carry-save  adder.  There  a  carry  save  adder  type  multi- 
plier was  chosen. 

Also,  according  to  [Ref.  8:  p.  361]  the  time  that  a 
carry-save  adder  takes  to  perform  an  N-bit  multiplication 
using  a  adder  for  which  each  addition/shift  cycle  takes  two 
clock  cycles  is  given  by: 

Tmult  =  (N+l)Tc  U.27) 

where 

Tc  -  is  the  clock  cycle  time 

In  the  case  being  discussed  the  multiplication 
involves  two  operands  -  the  mantissas.  Each  mantissa  is 
24-bits  long.  Therefore  according  to  the  formula  shown 
above,  the  multiplication  of  the  two  mantissas  will  take  25 
clock  cycles.  This  makes  the  the  multiplication  the  longest 
operation  involved. 

Note  that,  the  detection  of  a  zero  product  can  be 
done  concurrently  with  the  multiplication,  since  a  zero 
product  will  happen  only  in  the  case  where  one  of  the  oper- 
ands is  zero. 

The  normalization  must  still  be   done  sequentially. 
The  normalization   involves  at   most  one   left  shift   of  the 
mantissa  product   and  a  decrement   of  the   product  exponent. 
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There  is  only  at  most  one  shift,  since  the  mantissas  of  both 
operands  are  in  normalized  form  and  therefore  their  values 
are  between  0. 5  and  1.  In  the  worst  case,  the  two  mantissas 
are  both  0.1  (binary)  and  so  their  product  will  be  0.01 
(binary).  In  this  case  only  one  left  shift  is  necessary  in 
order  to  normalize  the  mantissa  of  the  product. 

The  normalization  requirement  that  the  standard 
makes  on  the  mantissa,  also  dictates  that  any  overflow  or 
underflow  of  the  exponent  product  does  not  have  a  possible 
recovery. 

In  conclusion,  the  floating  point  multiply  instruc- 
tion with  this  hardware  will  take  approximately  25  clocks  to 
complete. 

The  hardware  that  would  have  to  be  added  to  the 
iyiC58000  would  only  consist  cf  the  24  bit  carry-save  adder, 
the  exclusive-or  gate  and  some  logic  to  determine  overflow 
or  underflow  of  the  exponent  and  a  zero  product. 

All  this  hardware  will  be  more  or  less  equivalent  to 
two  of  the  32-bit  registers  existing  on  the  MC68000.  Say 
then,  that  due  to  power  dissipation  limitation^  on  the 
MC68000  two  of  the  32-bit  data  registers  would  then  be 
removed  from  the  MC68000,  in  order  to  add  the  additional 
hardware  necessary  to  implement  the  floating  point  multiply 
instruction. 

5.   The  Model 

As  stated  previously,  the  addition  of  the  instruc- 
tion will  have  some  costs.  One  of  these  costs  has  been 
referred  in  the  previous  subsection,  it  is  the  removal  of 
two  of  the  data  registers. 

As  one  might  expect  the  removal  of  some  of  the 
registers  from  the  MC68000  will  have  an  effect  on  the  system 
performance  by  reducing  the  number  of  registers  accesses  and 
increasing  the  number  of  main  memory  accesses. 
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In  the  specific  case  of  the  application  that  is 
being  considered,  this  is  not  true  because,  at  most,  six  of 
the  eight  data  registers  are  used  at  one  time.  Therefore, 
for  this  specific  case,  the  timing  costs  involved  due  to  the 
addition  of  the  floating  point  multiply  instruction  will  be 
zero. 

For  each  and  every  subroutine  involved  in  this 
application,  the  execution  time  of  the  subroutine  was 
computed  following  a  worst  case  and  a  best  case  criteria. 
The  difference  between  the  two  execution  time  values  for 
each  subroutine  arises  due  to  data  dependencies  on  the 
number  of  times  each  instruction  is  executed. 

The  execution  times  of  each  subroutine  were  then 
combined,  best  with  best  and  worst  with  worst,  in  order  to 
define  two  boundary  lines  for  the  final  execution  time  of 
the  whole  program. 

For  the  specific  case  of  the  floating  point  multiply 
subroutine,  the  smallest  execution  time  corresponds  to  a 
multiplication  of  two  floating  point  numbers  where  one  of 
them  is  zero.  The  longest  execution  time  for  the  same 
subroutine  corresponds  to  the  multiplication  of  two  numbers 
where  an  exponent  underflow  occurred  after  the  normalization 
step.  Here,  for  the  same  reason  as  before,  the  normaliza- 
tion requires  at  most  one  left  shift. 

Specifically,  the  values  obtained  for  the  execution 
times  of  each  subroutine  are  shown  in  Table  I  in  terms  of 
clock  cycles. 

For  the   whole  program   the  execution   time  will   be 
dependent  on   the  values  of   the  data   and  on  the   number  of 
entry  points  (N)   to  the  Fast  Fourier  Transform  computation. 
The  values  obtained   in  terms  of  clock  cycles   and  number  of 
required  floating  point  multiplies  are  shown  in  Table  II. 

The  best  case  and  the  worst  case  execution  of  a 
floating  point   multiply  subroutine   takes  respectively   203 
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TABLE  I 

EXECUTION  TIME  OF  EACH  SUBROUTINE 
IN  FAST  FOURIER  TRANSFORM  PROGRAM 


BEST  CASE 

WORST  CASE 

GETFP 

162 

162 

STEP 

180 

253 

NORM 

126 

1524 

ADDFP 

178 

1929 

MULTFP 

203 

604 

SINE 

2681+3MULTFP 

14459+9MULTFP 

COSINE 

3904+3MULTFP 

20756+9MULTFP 

TABLE  II 

FAST  FOURIER  TRANSFORM 
APPLICATION  PROGRAM  EXECUTION  TIME 


N 

BEST  CASE 

WORST  CASE 

16 

572482+352MULTFP 

1899074+736MULTFP 

32 

1418194+880MULTFP 

4734674+ 1840MULTFP 

64 

3484658+2 112MULTFP 

114442 10+4416MULTFP 

128 

8198594+4928MULTFP 

26770882+ 10304MULTFP 

256 

18901458+ 11264MULTFP 

61352402+23552MULTFP 

512 

42902562+25344MULTFP 

138417186+52992MULTFP 

1024 

9618622 6+56320MULTFP 

308440946+ 117760MULTFP 

2048 

2 13497794+ 123904MULTFP 

680458178+259072MULTFP 

4096 

469394450+2 7033 6MULTFP 

14882 17 106+565248MULTFP 

and  640  clock  cycles  to  execute.  For  a  clock  rate  of  10  MHZ, 
the  program  execution  time  before  the  addition  of  the  new 
instruction  will  be  is  in  Table  III. 
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TABLE  III 

FFT  PROGRAM  EXECUTION  TIME  BEFORE  THE  ADDITION 
OF  THE  FLOATING  POINT  MULTIPLY  INSTRUCTION 


N 

BEST 
EXECUTION  TIME 
{  SEC) 

WORST 
EXECUTION  TIME 
(  SEC) 

15 

0.054 

0.  234 

32 

0.  150 

0.  584 

54 

0.  391 

1.  411 

128 

0.  920 

3.  299 

255 

2.  119 

7.  558 

512 

4.  805 

17. 042 

1024 

10. 752 

37. 957 

2048 

23. 855 

83.594 

4095 

52. 427 

182. 953 

For  the  same  clock  rate,  the  program  execution  time 
after  the  addition  of  the  floating  point  multiply  instruc- 
tion is  shown  in  Table  IV. 

The  best  case  is  the  one  where  the  implementation  of 
the  floating  point  multiply  offers  less  gain. 

For  the  best  case 

Tgain  =  203  -  25  =  177  clock  cycles 

For  the  worst  case 

Tgain  =  504  -  25  =  578  clock  cycles 

As  already  explained,  for  both  cases  Tcost  is  zero. 
This  is  due  to  the  fact  that  in  the  particular  application 
program  two  of  the  general  purpose  data  registers  are  never 
used.  In  the  case  that  all  general  purpose  data  registers 
were  used  in  the  application  program  this  would  not  be  true. 
If  this  happened  then  there  would  be  an  increase  in  the 
ratio  of   the  number  of  register   accesses  to  the   number  of 
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TABLE  IV 

FFT  PROGRAM  EXECUTION  TIME  AFTER  THE  ADDITION 
OF  THE  FLOATING  POINT  MULTIPLY  INSTRUCTION 


N 

BEST 
EXECUTION  TIME 
(  SEC) 

WORST 
EXECUTION  TIME 
(  SEC) 

16 

0.  058 

0.  192 

32 

0.  144 

•  0.478 

64 

0.  354 

1.  156 

128 

0.  833 

2.  704 

256 

1.919 

6.  196 

512 

4.  356 

13. 979 

1024 

9.  765 

31.  150 

2048 

21.672 

68. 719 

4096 

47. 642 

150. 291 

main  memory  accesses,  causing  an  increase  on  the  average 
operand  access  time  and  an  increase  on  the  average  instruc- 
tion execution  time. 

Using  the  formula  for  the  model  regarding  the  timing 
analysis  the  performance  effects  of  the  addition  of  the 
floating  point  multiply  instruction  come  as  shown  in  Table 
V. 

From  these  results  one  can  see  that  the  improvement 
on  the  MC68000  performance  due  to  the  addition  of  the 
floating  point  multiply  instruction  for  this  specific  appli- 
cation varies  between  ten  and  twenty  percent  and  is 
independent  of  the  number  of  data  points  to  the  Fast  Fourier 
Transform  computation. 


74 


TABLE  V 

PERFORMANCE  EFFECTS  OF  THE  ADDITION  OF  THE 
FLOATING  POINT  MULTIPLY  INSTRUCTION 


N 

BEST  CASE 

WORST  CASE 

Perf 

Perf 

16 

1.  11 

1.  22 

32 

1.  11 

1.  22 

64 

1.  11 

1.22- 

128 

1.  11 

1.  22 

256 

1.  10 

1.  22 

512 

1.  10 

1.  22 

1024 

1.  10 

1.  22 

2048 

1.  10 

1.  22 

4096 

.  1.  10 

1.  22 
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.VII.  CONCLUSIONS 

This  thesis  began  by  making  an  identification  and  char- 
acterization of  a  new  and  controversial  type  of  computer 
architecture  called  RISC  for  Reduced  Instruction  Set 
Computers.  The  rise  of  this  new  computer  architecture  and 
the  discussions  that  followed  regarding  its  performance, 
when  RISC  machines  are  compared  with  CISC  machines,  has 
shown  the  need  for  an  appropriate  tool  to  evaluate  computer 
performance  from  an  architectural  point  of  view. 

This  thesis  suggests  a  model  to  be  used  by  computer 
architects  to  determine  the  performance  effects  of  an 
enhancement  to  a  computer  architecture.  The  computer  evalu- 
ation process  is  important,  since  it  generates  have  a  quan- 
tified perception  of  the  influences  that  each  enhancement  to 
the  system  architecture  will  have  on  the  system  performance. 
The  availability  of  a  model  to  do  computer  performance  eval- 
uation is  therefore  essential  in  the  decision-making  process 
for  determining  which  architectural  features  a  system  should 
have  to  optimize  its  performance  for  a  certain  application. 

To  develop  this  model  for  the  evaluation  of  computer 
performance,  a  conceptual  view  of  what  determines  the  system 
performance  was  formed.  It  is  the  author's  opinion  that  the 
performance  of  a  system  results  from  the  quality  of  the 
match  between  a  particular  application  requirement  and  the 
architectural  characteristics  of  the  system.  This  match  is 
done  through  the  customization  of  the  system  instruction 
set. 

The  model  that   is  suggested  is  divided   into  two  parts. 
The  first  part  makes  a  quantification  of  the  effects  that  an 
architectural  enhancement  to  the  system  has  in  the  execution 
time  of  a  "typical"  application  program.   The  second  part  of 
the   model  compares   the   efficiency  of   the   design  of   two 
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systems  control  units.  In  both  parts  the  model  considers 
that  the  application  determines  the  number  of  times  each 
instruction  of  the  system  instruction  set  is  executed. 

For  the  first  part,  the  system  architecture  determines 
the  execution  time  of  each  instruction.  For  the  second  part, 
the  system  architecture  determines  the  number  of  states 
through  which  the  system  control  unit  will  have  to  pass 
during  the  execution  of  the  application  program(s). 

Finally,  an  example  on  how  to  use  the  model,  in  order  to 
determine  what  are  the  costs  and  benefits  of  adding  an 
instruction  to  a  processor  instruction  set  for  a  particular 
application,  is  given. 

The  program  that  was  used  to  apply  the  model  is  a  bit 
misleading  in  the  quantification  of  the  cost/benefit  ratio 
of  the  enhancement.  This  is  due  to  the  fact  that  in  opposi- 
tion to  what  should  be  expected,  the  program  does  not  use 
all  the  system  architectural  resources  and  so,  even  before 
the  addition  of  the  new,  instruction  does  not  optimize  the 
system  performance.  If  that  were  not  the  case  and  the 
program  was  an  optimal  one  for  the  application  of  interest 
and  for  the  processor  chosen,  then,  surely,  the  enhancement 
to  the  system  architecture  would  have  some  costs. 

In  any  event  and  even  considering  that  the  example  is  a 
bit  misleading,  the  author  arrived  at  two  criteria,  each  one 
derived  from  one  of  the  parts  of  the  model,  for  which  the 
addition  of  an  instruction  to  a  system  instruction  set  has 
to  obey  so  that  the  performance  of  the  system  for  the 
particular  application  is  increased. 

These  two  criteria  will  be  applied  if  the  new  instruc- 
tion execution  time  is  smaller  than  the  execution  time  of 
the  sequence  of  instructions  that  implemented  the  function 
before  the  addition  of  the  new  instruction  to  the  system. 

For  the  first  part  of  the  model  the  criterion  for  the 
addition  of  the  new  instruction,  states  that: 
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Nnew  >  Tcost  /  Tgain 

where 

Nnew  -  is  the  number  of   times  the  new  instruction 

is  executed  for  the  particular  application 
Tgain  -  is  the   difference  in  the   execution  times 
of  the   sequence  of  instructions   that  had 
to  be   executed  by   the  system   every  time 
the   operation  was  performed  before   the 
addition  of   the  new   instruction  and   the 
execution  time  of  the  new  instruction. 
Tcost  -  is  the  increase  in  the  application  program 
execution   time   that  was   caused  by   the 
suppression  of  some  hardware  components  of 
the  processor 
For  the  second  part  of  the  model,   the  criterion  for  the 
addition  of  the  new  instruction,  states  that: 

Nnew  >  Es  /  Ls 

where 

Ls  -  represents  the  gain  in   the  number  of  control 
unit  states,  obtained  each  time  the  operation 
performed  by   the  the  instruction   and/or  the 
sequence  is  executed. 
Es  -  represents  the   cost  in  the  number   of  states 
due  to  the  addition  of  the  new  instruction  to 
the  system  instruction  set. 
The  two  parts  of  the  model  need  to  be  thoroughly  checked 
and  confirmed  with  measured  values,    so  that  their  validity 
is  determined. 
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APPENDIX  A 
FAST  FOURIER  TRANSFORM 


FFT 

MOVE. W 

N,N2 

ASR.  W 

N2 

MOVE. W 

NU,NU1 

SUBI. W 

#1,NU1 

CLR.  W 

K 

MOVE. W 

NU,DO 

LOOPl 

BEQ.  S 

100 

102 

MOVE. W 

N2,D1 

L00P2 

BEQ.  S 

101 

MOVE. W 

NUl , D2 

MOVE. W 

K,D3 

L00P3 

BEQ.  S 

200 

ASR.  W 

#1,D3 

SUBI.  W- 

#1,D2 

BRA 

L00P3 

200 

MOVE. W 

D3,  J 

JSR 

IBITR 

MOVE. L 

RX,P 

MOVE. W 

N,D3 

MOVEQ. L 

#159, D4 

300 

ASL 

#1,D3 

SUBI. L 

#1,D4 

BCC 

300 

MOVE. B 

#9,D5 

LSR.  L 

D5,D3 

ROR.  L 

D5,D4 

AND I. L 

mask,D4 

OR.  L 

D4,D3 

MOVE. L 

D3 , FPN 

N2=N/2 


NU1=NU-1 


K=0 

DO   100  L=1,NU 

DO   101  1=1, N2 

P=IBITR( K/2**NU1,NU) 


ARC  =  6. 283185*P/FL0AT(N) 
convert  N  to  float,  point 


clear  D4  except  exponent 
D3  <--  FLOAT(N) 
store  FPN 
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400 


MOVE. L 
MOVEQ. L 
ASL 
SUBI. L 
BCC 

MOVE. B 
LSR.  L 
ROR.  L 
ANDI. L 
OR.  L 
MOVE. L 
LEA 

LEA 

LEA 
JSR 

MOVE. L 
MOVE. B 
JSR 
LEA 
JSR 
JSR 
LEA 
JSR 

MOVE. L 
JSR 

MOVE. L 
JSR 

MOVE. L 
MOVE. W 
ADDI. W 
MOVE. W 
ADD.  W 
MOVE. W 


P,D3 
#159, D4 
#1,D3; 
#1,D4 
400 
#9,D5 
D5,D3 
D5,D4 
mask,D4 
D4,D3 
D3 , FPP 
FPWR,A2 

FPACCAl 

FPP,AO 

GETFP 

#2PI,(A1) 

#2PI,2(A1) 

MULTFP 

FPN,AO 

GETFP 

DIVFP 

ARC , AC 

STEP 

ARG,X 

COSINE 

RESULT, C 

SINE 

RESULT, S 

K,K1 

#1.K1 

K1,D3 

N2  ,  D3 

D3 , K1N2 


; convert  P  to  float,  point 


clear  D4  except  exponent 

D4  <--  FLOAT(P) 

store  FPP 

A2  points  to  Floating  Point 

Working  Register 

Al  points  to  Floating  Point 

Accumulator 

FPWR  <--  FPP 

FPACC  <--  2PI 

FPACC  <--  2PI 
FPWR  <--  FPN 

FPACC  <--  2PI/FPN 
store  ARG 

C=COS( ARG) 

store  C 
S=SIN( ARG) 
store  S 
K1=K+1 

K1N2=K1+N2 
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LEA 

LEA 

ASL.  W 

SUBI. W 

ADDA. W 

ADDA. W 

MOVE A. L 

JSR 

MOVE. L 

MOVE. B 

LEA 

JSR 

JSR 

LEA 

JSR 

MOVEA. L 

JSR 

MOVE. L 

MOVE. B 

LEA 

JSR 

JSR 

LEA 

JSR 

JSR 

JSR 


MOVEA.  L 
JSR 

MOVE. L 
MOVE. B 
LEA 
JSR 


XREAL , A3 

XIMAG,A4 

#1,D3 

#2,D3 

D3,A3 

D3,A4 

A3,  AG 

GETFP 

(  A2 ) , ( Al ) 

2(A2),2(A1) 

C,AO 

GETFP 

MULTFP 

TREAL^AO 

STEP 

A4,A0 

GETFP 

(A2),(A1) 

2(A2),2( Al) 

S,AO 

GETFP 

MULTFP 

TREAL,AO 

GETFP 

ADDFP 

STEP 


A3,  AG 

GETFP 

(  A2 ) , ( Al ) 

2( A2) ,2( Al) 

S,AG 

GETFP 


TREAL=XREAL(  K1N2 ) *C+ 

+XIMAG(K1N2)*S 

D3  <--  2*K1N2 
D3  <--  2*KlN2-2 


FPWR  <--  XREAL(K1N2) 

FPACC  <--  FPWR 

FPWR  <--  c 

FPACC  <--  XREAL(K1N2)*C 
store  partial  result 

FPWR  <--  XIMAG(K1N2) 

FPACC  <--  FPWR 

FPWR  <--  S 

FPACC  <--  XIMAG(K1N2)*S 
FPWR  <--  partial  TREAL 

FPACC  <--  TREAL 
store  TREAL 
TIMAG=XIMAG(  K1N2 ) *C- 

-XREAL(K1N2)*S 
FPWR  <--  XREAL(K1N2) 

FPACC  <--  FPWR 

FPWR  <--  S 
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JSR 

LEA 

JSR 

EORI. L 

MOVEA.  L 

JSR 

MOVE. L 

MOVE. B 

LEA 

JSR 

JSR 

LEA 

JSR 

JSR 

JSR 

EORI 
MOVE. L 
LEA 

MOVE. L 
ASL 
SUBI. L 
ADDA 
MOVEA. L 
JSR 

MOVE. L 
MOVE. B 
MOVEA. L 
JSR 
JSR 
JSR 


EORI 
MOVE. L 


MULTFP 

TIMAG,AO 

STEP 

mask, ( AG) 

A4,A0 

GETFP 

(  A2 ) , ( Al ) 

2(A2),2(A1) 

C,AO 

GETFP 

MULTFP 

TIMAG>AO 

GETFP 

ADDFP 

STEP 

mask,TREAL 

TREAL,(  A3) 

XREAL,A5 

K1,D3 

#1,D3 

#2,D3 

D3,A5 

A5,A0 

GETFP 

(A2),(A1) 

2(  A2) ,2( Al) 

A3,  AG 

GETFP 

ADDFP 

STEP 


mask, T I MAG 
TIMAG,( A4) 


FPACC  <--  XREAL(K1N2)*S 
store  partial  result 

change  sign  of  TIMAG 
FPWR  <--  XIMAG(K1N2) 

FPACC  <--  FPWR 

FPWR  <--  C 

FPACC  <--  XIMAG(K1N2)*C 
FPWR  <--  partial  TIMAG 

FPACC  <--  TIMAG 

store  TIMAG 

XREAL( K1N2 ) =XREAL(  Kl ) -TREAL 

change  sign  of  TREAL 

XREAL(K1N2)  <--  TREAL 


FPWR  <--  XREAL(Kl) 

FPACC  <--  FPWR 

FPWR  <--  XREAL(K1N2) 

FPACC  <--  XREAL(Kl) -TREAL 

store 

XIMAG( K1N2 ) =XIMAG(  Kl ) - 

-TIMAG 
change  sign  of  TIMx^^G 
XIMAG(K1N2)  <--  -TIMAG 
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LEA 

XIMAG,A6 

ADDA. L 

D3,A6 

MOVE A. L 

A5,A0 

JSR 

GETFP 

MOVE. L 

(A2),-(A1) 

MOVE. B 

2(A2),2(A1) 

MOVEA. L 

A4,A0 

JSR 

GETFP 

JSR 

ADDFP 

JSR 

STEP 

FORI 

mask,TREAL 

LEA 

TREAL,AO 

JSR 

GETFP 

MOVE. L 

(A2),(A1) 

MOVE. B 

2(A2),2(A1) 

MOVEA. L 

A5,A0 

JSR 

GETFP 

JSR 

ADDFP 

JSR 

STEP 

FORI 

mask,TIMAG 

LEA 

TIMAG,AO 

JSR 

GETFP 

MOVE. L 

(  A2 ) , ( Al ) 

MOVE. B 

2(A2),2(A1) 

MOVEA. L 

A6,A0 

JSR 

GETFP 

JSR 

ADDFP 

JSR 

STEP 

ADDI. W 

#1,K 

SUBQ. W 

#1,D1 

BRA 

L00P2 

A5  -->  XIMAG(Kl) 
FPWR  <--  XIMAG{K1) 

FPACC  <--  FPWR 

FPWR  <--  XIMAG(K1N2) 

FPACC  <--  XIMAG(K1N2) 

store 

XREAL(  Kl ) =XREAL( Kl )  + 

+TREAL 
change  sign  of  -TREAL 
FPWR  <--  TREAL 

FPACC  <--  FPWR 

FPWR  <--  XREAL(Kl) 

FPACC  <--  final  XREAL(Kl) 

store 

XIMAG(  Kl ) =XIMAG( Kl )  + 

+TIMAG 
change  sign  of  -TIMAG 
FPWR  <--  TIMAG 

FPACC  <--  FPWR 

FPWR  <--  partial  XIMAG( Kl ) 

FPACC  <--  final  XIMAG(Kl) 

store 

K=K+1 
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101 


100 


L00P4 


MOVE. W 

N2,D1 

; K=K+N2 

ADD.  W 

K,D1 

MOVE. W 

D1,K 

CMP.  W 

N,D1 

;IF  (K.  LT.N)  GO  TO 

BMI 

102 

CLR.  W 

K 

;K=0 

SUBI.  W 

#1,NU1 

;NU1=NU1-1 

ASR.  W 

N2 

;N2=N2/2 

SUBQ. W 

#1,D0 

BRA 

LOOPl 

MOVE. W 

N,DO 

MOVE. W 

#1,D1 

;D0   103  K=1,N 

BEQ.  S 

103 

MOVE. W 

Dl,  J 

• I=IBITR(K-1,NU)+1 

SUBI. W 

#1.J 

JSR 

IBITR 

MOVE. W 

RX,  I 

; 

ADDI. W 

#1,1 

CMP.  W 

I,D1 

; IF  (I. LE. K)  GO  TO 

BPL 

1003 

LEA 

XREAL , 

A3 

;TREAL=XREAL(K) 

LEA 

XIMAG, 

A4 

MOVE. W 

D1,D2 

ASR 

#1,D2 

SUBI. W 

#2,D2 

MOVEA. L 

A3,A5 

MOVE A. L 

A4,A6 

MOVE. W 

I,D3 

ASR 

#1,D3 

SUBI 

#2,D3 

ADDA. L 

D1,A3 

A3  -->  XREAL(K) 

ADDA. L 

D1,A5 

A5  -->  XIMAG( K) 

ADDA. L 

D2,A4 

A4  -->  XREAL( I) 

ADDA. L 

D2,A6 

A6  -->  XIMAG( I) 

MOVE. L 

( A3 ) , TREAL 
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1003 


103 


MOVE. 

L 

(A5),TIMAG 

TIMAG=XIMAG(K) 

MOVE. 

L 

(A4),(A3) 

XREAL(K)=XREAL( 

I) 

MOVE. 

L 

(A6),(A5) 

XIMAG(K)=XIMAG( 

I) 

MOVE. 

L 

TREAL,(A4) 

XREAL( I)=TREAL 

MOVE. 

L 

TIMAG,(A6) 

■XIMAG( I)=TIMAG 

ADDQ. 

W 

#1,D1 

SUBQ. 

W 

#1,D0 

BRA 

L00P4 

RTS 

; RETURN 
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APPENDIX  B 
IBITR  FUNCTION 


IBITR 


LOOP 


2000 


MOVEM. L 

D0-D3,-(A7) 

; save  registers 

MOVE. W 

J,  Jl 

;  J1=J 

CLR.  W 

IBIT 

; IBITR=0 

MOVE. W 

NU,DO 

;D0   200  1=1, NU 

BEQ.  S 

2000 

MOVE. W 

J1,D1 

; J2=Jl/2 

ASR.  W 

#1,D1 

. 

MOVE. W 

D1,D2 

;D2  <--  J2 

;IBITR=IBITR*2+( Jl 

ASL.  W 

#1,D2 

MOVE. W 

J1,D3 

SUB.  W 

D2,D3 

;D2  <--  (J1-2*J2) 

ASL 

IBIT 

ADD.W 

D3, IBIT 

MOVE. W 

Dl,  Jl 

J1=J2 

SUB  I 

#1,D0 

BRA 

LOOP 

MOVEM. L 

( A7)+,D0- 

-D3   , 

restore  registers 

RTS 

RETURN 

86 


APPENDIX  C 
SINE  FUNCTION 


SINE 


100 


200 


300 


400 


MOVEM. L 

D0-D4,-(A7) 

•save  registers 

MOVE. L 

X,DO 

BTST. L 

#bit,X 

test  sign  of  X 

BNE 

100 

MOVE. B 

#-l,SGN 

■SGN  <--  -1 

BCHG 

#bit,DO 

■DO  <--  -DO 

BRA 

200 

MOVE. B 

#1,SGN 

•SGN  <--  1 

MOVE. L 

DO,Y 

•Y  <--  DO 

CMP.  L 

YMAX^DO 

lYMAX  -  DO 

BPL 

300 

error  message 

MOVE  A.  L 

Y,AO 

;A0  -->  Y 

JSR 

GETFP 

. FPWR  <--  Y 

MOVE. L 

1/PI,(A1) 

; FPACC  <--  inverse  of  pi 

MOVE. B 

1/PI,2(A1) 

JSR 

MULTFP 

iFPACC  <--  Y/P I 

MOVEA.  L 

Y/P I , AO 

;A0  -->  Y/P I 

JSR 

STEP 

•  store  Y/PI 

MOVE. L 

Y/PI,D1 

•Dl  <--  Y/PI 

MOVE. L 

D1,D2 

ANDI. L 

mask,Dl 

;D1  <--  mantissa 

BSET 

#bit,Dl 

; insert  hidden  bit 

LSR 

#7,D2 

■hi  D2  has  exponent 

SWAP 

D2 

• lo  D2  has  exponent 

SUBI. B 

#127, D2 

; extract  bias 

BPL 

400 

■if  positive  go  to  400 

MOVE. W 

#0,N 

;clear  N 

BRA 

500 

BNE 

600 

;if  zero  go  to  600 
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600 


500 


700 


MOVE.W 

#1,N 

BRA 

500 

ASL.  L 

D2,D1 

AND  I 

inask,Dl 

ASR.  L   ' 

#7,D1 

SWAP 

Dl 

MOVE. W 

D1,N 

MOVE. L 

Y/P I , XN 

BTST. B 

#0,N 

BEQ 

700 

BCHG 

#7,SGN 

MOVE. L 

X,  |X| 

AND  I 

mask, 1 X| 

MOVE A. L 

XN,AO 

JSR 

GETFP 

MOVE. L 

-C1,(A1) 

MOVE. B 

-C1,2(A1) 

JSR 

MULTFP 

MOVE A. L 

1 X 1 , AO 

JSR 

GETFP 

JSR 

ADDFP 

MOVE A. L 

TEMP , AO 

JSR 

STEP 

MOVEA. L 

XN,AO 

JSR 

GETFP 

MOVE. L 

-C2,(A1) 

MOVE. B 

-C2,2( Al) 

JSR 

MULTFP 

MOVEA. L 

TEMP,AO 

JSR 

GETFP 

JSR 

ADDFP 

MOVEA. L 

F,AO 

JSR 

STEP 

N  <--  1 

shift  left  mantissa  by 
exponent  value,  max  =  8 
leave  only  integer  part 

mantissa  in  lo  Dl 

N  <--  integer  of  mantissa 

XN  <--  FLOAT(N) 

N  even  ? 

if  even  do  nothing 

otherwise 

change  sign  of  SGN 

determine  F 

clear  sign  bit 

FPWR  <--  XN 

FPACC  <--  CI 

FPACC  <--  -(XN*C1) 
FPWR  <--  |X| 

FPACC  <--  |X|-(XN*C1) 
store  FPACC 

FPWR  <--  XN 

FPACC  <--  -C2 


FPACC  <-■ 
FPWR  <-- 

FPACC  <- 
store  F 


-(XN*C2) 

X|-(XN*C1) 


-  F 
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MOVE. L 
ANDI. L 
CMPI. L 
BMI 


MOVE A. L 
JSR 

MOVE. L 
MOVE. B 
JSR 

MOVE. L 
MOVE. B 
MOVE. L 
MOVE. B 
JSR 

MOVEA. L 
JSR 

MOVE. L 
MOVE. B 
JSR 

MOVEA.  L 
JSR 
JSR 

MOVE. L 
MOVE. B 
JSR 

MOVEA. L 
JSR 
JSR 

MOVE. L 
MOVE. B 
JSR 
MOVEA..  L 


F,  |F| 
mask, I F | 
|F| ,#eps 
800 


F,AO 

GETFP 

(  A2 ) , ( Al ) 

2(A2)  ,2(A1) 

MULTFP 

(A1),(A2) 

2(A1),2(A2) 

R4,(A1) 

R4,2(A1) 

MULTFP 

G,AO 

STEP 

R3  ,  (  A2  ) 

R3,2(  A2) 

ADDFP 

G,AO 

GETFP 

MULTFP 

R2  ,  (  A2  ) 

R2  ,  2  (  A2  ) 

ADDFP 

G,AO 

GETFP 

MULTFP 

Rl,( A2) 

Rl,2( A2) 

ADDFP 

G,AO 


|F|  <—  F 
clear  sign  bit 
|F|  -  eps 

branch  if  |f|  <  eps 
otherwise 
determine  R(g) 
FPWR  <--  F 

FPACC  <--  F 

FPACC  <--  F*F 
G  =  F*F 
FPWR  <--  G 

FPACC  <--  r4 

FPACC  <--  r4*G 
store  G 

FPWR  <--  r3 

FPACC  <--  r4*G+r3 
FPWR  <--  G 

FPACC  <--  (r4*G+r3)*G 
FPWR  <--  r2 

FPACC  <--  ( r4*G+r3)*G+r2 
FPWR  <--  G 

FPACC  <--  (  (   )*G+r2)*G 
FPWR  <--  rl 

FPACC  <--  (   )*G+rl 
FPWR  <--  G 
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JSR 

GETFP 

JSR 

MULTFP 

MOVEA.  L 

F,AO 

JSR 

GETFP 

JSR 

MULTFP 

JSR 

ADDFP 

MOVEA.  L 

RESULT, AO 

JSR 

STEP 

BRA 

900 

800 

MOVE. L 

F, RESULT 

900 

MOVE. B 

SGN,D3 

BPL 

DONE 

DONE 


MOVE. L 
BCHG 
MOVE. L 
MOVEM. L 
RTS 


RESULT, D4 
#31,D4 
D4, RESULT 
(A7)+,D0-D4 


FPACC  <--  R(  g) 
FPWR  <--  F 

FPACC  <--  F*R(g) 
FPACC  <--  F*R(g)+F 
store  result 


result  <--  F 

test  value  of  SGN 

if  positive  do  nothing 

otherwise 

change  sign  of  result 


restore  registers 
return  to  main  program 
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APPENDIX  D 
COSINE  FUNCTION 


COSINE   MOVEM. L 

D0-D4,-(A7)   . 

; save  registers 

MOVE. B 

#1,SGN 

•SGN  <--  1 

MOVE. L 

X,  |X| 

;  |X|  <—  X 

AND  I 

mask, 1 X| 

■clear  sign  bit 

MOVEA. L 

|X|,AO 

;FPWR  <--  |X| 

JSR 

GETFP 

MOVE. L 

PI/2,(A1) 

•FPACC  <--  PI/2 

MOVE. B 

PI/2,2(A1) 

JSR 

ADDFP 

•FPACC  <--  IXl+PI/2 

MOVEA. L 

Y,AO 

store  Y 

JSR 

STEP 

MOVE. L 

Y,DO 

DO  <--  Y 

CMP.  L 

YMAX,DO 

YMAX  -  DO 

BPL 

100 

error  message 

100      MOVEA. L 

Y,AO 

•AO  -->  Y 

JSR 

GETFP 

■FPWR  <--  Y 

MOVE. L 

1/PI,(A1) 

FPACC  <--  inverse  of  pi 

MOVE. B 

1/PI,2(A1) 

JSR 

MULTFP 

FPACC  <--  Y/P I 

MOVEA.  L 

Y/P I , AG 

AO  -->  Y/PI 

JSR 

STEP 

store  Y/PI 

MOVE. L 

Y/PI,D1 

Dl  <--  Y/PI 

MOVE. L 

D1,D2 

ANDI. L 

mask,Dl       , 

Dl  <--  mantissa 

BSET 

#bit,Dl 

insert  hidden  bit 

LSR 

#7,D2 

hi  D2  has  exponent 

SWAP 

D2 

lo  D2  has  exponent 

SUBI. B 

#127, D2 

extract  bias 

BPL 

200 

if  positive  go  to  200 
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200 


400 


300 


500 


MOVE. W 

#0,N 

BRA 

300 

BNE 

400 

MOVE. W 

#1,N 

BRA 

300 

ASL.  L 

D2,D1 

AND  I 

mask,Dl 

ASR.  L 

#7,D1 

SWAP 

Dl 

MOVE. W 

D1,N 

MOVE. L 

Y/PI,XN 

BTST. B 

#0,N 

BEQ 

500 

BCHG 

#7 , SGN 

MOVEA. L 

XN,AO 

JSR 

GETFP 

MOVE. L 

#-.5,(Al) 

MOVE. B 

#-.5,2(Al) 

JSR 

ADDFP 

JSR 

STEP 

MOVEA. L 

XN,AO 

JSR 

GETFP 

MOVE. L 

-C1,(A1) 

MOVE. B 

-C1,2(A1) 

JSR 

MULTFP 

MOVEA. L 

1 X 1 , AO 

JSR 

GETFP 

JSR 

ADDFP 

MOVEA. L 

TEMP , AO 

JSR 

STEP 

MOVEA.  L 

XN ,  AO 

JSR 

GETFP 

clear  N 

if  zero  go  to  400 
N  <--  1 

shift  left  mantissa  by 
exponent  value,  max  =  8 
leave  only  integer  part 

mantissa  in  lo  Dl 

N  <--  integer  of  mantissa 

XN  <--  FLOAT(N) 

N  even  ? 

if  even  do  nothing 

otherwise 

change  sign  of  SGN 

FPWR  <--  XN 

FPACC  <--  . 5 

FPACC  <--XN-. 5 
store  XN 
determine  F 
FPWR  <--  XN 

FPACC  <--  CI 

FPACC  <--  -(XN*C1) 
FPWR  <--  iX| 

FPACC  <--  |X|-(XN*C1) 
store  FPACC 

FPWR  <--  XN 
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MOVE. L 

-C2,(A1) 

MOVE. B 

-C2,2(A1) 

JSR 

MULTFP 

MOVE A. L 

TEMP , AG 

JSR 

GETFP 

JSR 

ADDFP 

MOVEA. L 

F,AO 

JSR 

STEP 

MOVE. L 

F,  |F| 

ANDI. L 

mask, 1 F | 

CMPI. L 

|F|,#eps 

BMI 

600 

MOVEA. L 

F,AO 

;FPWR  <--  F 

JSR 

GETFP 

MOVE. L 

(  A2 ) , ( Al ) 

; FPACC  <--  F 

MOVE. B 

2(A2),2(A1) 

JSR 

MULTFP 

; FPACC  <--  F*F 
;G  =  F*F 

MOVE. L 

(A1),(A2) 

;FPWR  <--  G 

MOVE. B 

2(A1),2(A2) 

MOVE. L 

R4,(A1) 

•FPACC  <--  r4 

MOVE. B 

R4,2(A1) 

JSR 

MULTFP 

•FPACC  <--  r4*G 

MOVEA. L 

G,AO 

•  store  G 

JSR 

STEP 

MOVE. L 

R3  ,  (  A2  ) 

•FPWR  <--  r3 

MOVE. B 

R3,2( A2) 

JSR 

ADDFP 

FPACC  <--  r4*G+r3 

MOVEA. L 

G,AQ 

FPWR  <--  G 

JSR 

GETFP 

JSR 

MULTFP 

FPACC  <--  ( r4*G+r 

MOVE. L 

R2  ,  (  A2  ) 

FPWR  <--  r2 

MOVE. B 

R2,2(A2) 

FPACC  <--  -C2 


FPACC  <-■ 
FPWR  <-- 

FPACC  <- 
store  F 


-(XN*C2) 
X|-(XN*C1) 


-  F 


|F|  <--  F 
clear  sign  bit 
|F|  -  eps 

branch  if  |f|  <  eps 
otherwise 
determine  R(  g) 


93 


.  JSR 

ADDFP 

MOVE A. L 

G,AO 

JSR 

GETFP 

JSR 

MULTFP 

MOVE. L 

R1,(A2) 

MOVE. B 

R1,2(A2) 

JSR 

ADDFP 

MOVEA. L 

G,AO 

JSR 

GETFP 

JSR 

MULTFP 

MOVEA. L 

F,AO 

JSR 

GETFP 

JSR 

MULTFP 

JSR 

ADDFP 

MOVEA. L 

RESULT, AO 

JSR 

STEP 

BRA 

700 

600 

MOVE. L 

F, RESULT 

700 

MOVE. B 

SGN,D3 

BPL 

DONE 

DONE 


MOVE. L 
BCHG 
MOVE. L 
MOVEM. L 
RTS 


RESULT, D4 
#31,D4 
D4, RESULT 
( A7)+,D0-D4 


FPACC  <--  ( r4*G+r3)*G+r2 
FPWR  <--  G 

FPACC  <--  ( (   )*G+r2)*G 
FPWR  <--  rl 

FPACC  <--  (   )*G+rl 
FPWR  <--  G 

FPACC  <--  R(g) 
FPWR  <--  F 

FPACC  <--  F*R(g) 
FPACC  <--  F*R(g)+F 
store  result 


result  <--  F 

test  value  of  SGN 

if  positive  do  nothing 

otherwise 

change  sign  of  result 


restore  registers 
return  to  main  program 
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